1
\documentclass{sigplanconf}
3
% The following \documentclass options may be useful:
5
% 10pt To set in 10-point type instead of 9-point.
6
% 11pt To set in 11-point type instead of 9-point.
7
% authoryear To obtain author/year citation style instead of numeric.
16
\conferenceinfo{PGAS '11}{October 15, Galveston.}
18
\copyrightdata{[to be supplied]}
20
\titlebanner{preprint} % These are ignored unless
21
\preprintfooter{preprint} % 'preprint' option specified.
23
\title{Automatic Parallelization of Numerical Python Applications Using the
24
Global Arrays Toolkit}
25
%\subtitle{Subtitle Text, if any}
27
\authorinfo{Jeff Daily}
28
{Pacific Northwest National Laboratory}
30
\authorinfo{Robert R Lewis}
31
{Washington State University}
32
{bobl@tricity.wsu.edu}
39
Global Arrays is a software system from Pacific Northwest National Laboratory
40
that enables an efficient, portable, and parallel shared-memory programming
41
interface to manipulate distributed dense arrays. The NumPy module is the de
42
facto standard for numerical calculation in the Python programming language, a
43
language whose use is growing rapidly in the scientific and engineering
44
communities. NumPy provides a powerful N-dimensional array class as well as
45
other scientific computing capabilities. However, like the majority of the
46
core Python modules, NumPy is inherently serial. Using a combination of Global
47
Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in
48
replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications
49
can become parallel, scalable GAiN applications with only minor source code
50
changes. GAiN's design follows the owner computes rule, utilizing both the
51
data locality and one-sided communication capabilities of Global Arrays.
52
Scalability studies of several different GAiN applications will be presented
53
showing the utility of developing serial NumPy codes which can later run on
54
more capable clusters or supercomputers.
57
%\category{CR-number}{subcategory}{third-level}
60
%Global Arrays, PGAS, NumPy, Python, GAiN
63
Global Arrays, PGAS, NumPy, Python, GAiN
65
\section{Introduction}
67
Scientific computing with Python typically involves using the NumPy package.
68
NumPy provides an efficient multi-dimensional array and array processing
69
routines. Unfortunately, like many Python programs, NumPy is serial in nature.
70
This limits both the size of the arrays as well as the speed with which the
71
arrays can be processed to the available resources on a single compute node.
73
For the most part, NumPy programs are written, debugged, and run in
74
singly-threaded environments. This may be sufficient for certain problem
75
domains. However, NumPy may also be used to develop prototype software. Such
76
software is usually ported to a different, compiled language and/or explicitly
77
parallelized to take advantage of additional hardware.
79
Global Arrays in NumPy (GAiN) is an extension to Python that provides
80
parallel, distributed processing of arrays. It implements a subset of the
81
NumPy API so that for some programs, by simply importing GAiN in place of
82
NumPy they may be able to take advantage of parallel processing transparently.
83
Other programs may require slight modification. This allows those programs to
84
take advantage of the additional cores available on single compute nodes and
85
to increase problem sizes by distributing across clustered environments.
87
This work builds on \cite{Dai09} by increasing the scalability of the approach
88
from from 16 cores to 2K cores. This manuscript also builds on \cite{Dai11} by
89
expanding both the design discussion and background such that this paper may
90
stand on its own. The previous manuscript was prepared for the Python
91
community and as such relevant background material on the Python language and
92
related modules was omitted. An alternative parallelization strategy is also
93
implemented and discussed while other parallelization strategies are
94
additionally proposed while discussing future work. Lastly, the difficulty of
95
scaling the Python interpreter has been added to the evaluation.
99
Like any complex piece of software, GAiN builds on many other foundational
100
ideas and implementations. This background is not intended to be a complete
101
reference, rather only what is necessary to understand the design and
102
implementation of GAiN. Further details may be found by examining the
103
references or as otherwise noted.
107
Python \cite{Lun01,Pyt11a} is a machine-independent, bytecode interpreted,
108
object-oriented programming (OOP) language. It can be used for rapid
109
application development, shell scripting, or scientific computing to name a
110
few. It gives programmers and end users the ability to extend the language
111
with new features or functionality. Such extensions can be written in C, C++,
112
FORTRAN, or Python. It is also a highly introspective language, allowing code
113
to examine various features of the Python interpreter at run-time and adapt as
114
needed. Although Python v3.2 exists, the following sections explain specific
115
features of Python 2.7 as it is the version most commonly used by the
116
scientific Python community \cite{Pyt11b}.
118
\subsubsection{Operator Overloading}
120
User-defined classes may implement special methods that are then invoked by
121
built-in Python functions. This allows any user-defined class to behave like
122
Python built-in objects. For example, if a class defines \verb=__len__()= it
123
will be called by the built-in len(). There are special methods for object
124
creation, deletion, attribute access, calling objects as functions, and making
125
objects act like Python sequences, maps, or numbers. Classes need only
126
implement the appropriate overloaded operators.
128
\subsubsection{Object Construction}
130
User-defined classes control their instance creation using either the
131
overloaded \verb=__new__()=, \verb=__init__()=, or both. \verb=__new__()= is a
132
special-cased static method that takes the class of which an instance was
133
requested as its first argument. It must return a class instance, but it need
134
not be an instance of the class that was requested. If \verb=__new__()=
135
returns an instance of the requested class, then the instance’s
136
\verb=__init__()= is called. If some other class instance is returned, then
137
\verb=__init__()= is not called on that instance. \verb=__new__()= was
138
designed to allow for subclasses of Python’s immutable types but can be used
141
\subsubsection{Slicing}
143
Python supports two types of built-in sequences, immutable and mutable. The
144
immutable sequences are the \texttt{string}s, \texttt{unicode}s, and
145
\texttt{tuple}s while the only mutable type is the \texttt{list}. Python
146
sequences generally support access to their items via bracket “[]” notation
147
and accept either an integer $k$ where $0 <= k < N$ and $N$ is the length of
148
the sequence, or a \texttt{slice} object. Out-of-bounds indices, as opposed to
149
out-of-bounds slices, will result in an error. However, negative indices and
150
slices are supported by the built-in Python sequences by calculating offsets
151
from the end of the sequence. It is up to the implementing class whether to
152
support negative indices and slices when overloading the sequence operators.
154
\texttt{slice} objects are used for describing \emph{extended slice syntax}.
155
\texttt{slice} objects describe the beginning, end, and increment of a
156
subsequence via the start, stop, and step attributes, respectively. When
157
\texttt{slice}s are used within the bracket notation, they can be represented
158
by using the \texttt{slice()} constructor or as colon-separated values. The
159
start value is inclusive of its index while the stop value is not. If the
160
start value is omitted it defaults to 0. Similarly, stop defaults to the
161
length of the sequence and step defaults to 1. The \texttt{slice} can be used
162
in lieu of an index for accessing a subsequence of the sequence object.
163
Slicing a built-in Python sequence always returns a copy of the returned
164
subsequence. User-defined classes are free to abandon that convention.
165
Further, \texttt{slice}s that are out-of-bounds will silently return an
166
in-bounds sequence which is different behavior than simple single-item
169
To illustrate both index and \texttt{slice} access to Python sequences assume
170
we have the \texttt{list} of \texttt{int}s \texttt{A=[0,1,2,3,4,5,6,7,8,9]}.
171
An example of single-item access would be \texttt{A[1]=1}. Using slice syntax
172
would look like \texttt{A[1:2]=[1]}. A negative single-item index looks like
173
\texttt{A[-1]=9}. Finally, an example of slice syntax that includes a step is
174
\texttt{A[2:9:3]=[2,5,8]}. Note in these examples how single-item access
175
returns an int whereas slicing notation returns a list.
177
\subsubsection{Object Serialization}
179
Many languages have the need for object serialization and de-serialization.
180
Serialization is the process by which a class instance is transformed into
181
another, often succinct, representation (like a byte stream) which can
182
persist or be transmitted. It can then be de-serialized back into its original
183
object. This is also commonly known as marshalling and unmarshalling. In
184
Python this is also called pickling and unpickling.
186
The \texttt{pickle} module is part of the Python Standard Library \cite{Lun01}
187
and can serialize nearly everything in Python. It is designed to be backwards
188
compatible with earlier versions of Python. Certain Python objects, such as
189
modules or functions, are serialized by name only rather than by their state.
190
They can only be de-serialized so long as the same module or function exists
191
within the scope where the de-serialization occurs. There is no guarantee that
192
the same function will result from the de-serialization or that the function
193
exists in the target namespace.
197
NumPy \cite{Oli06} is a Python extension module which adds a powerful
198
multidimensional array class *ndarray* to the Python language. NumPy also
199
provides scientific computing capabilities such as basic linear algebra and
200
Fourier transform support. NumPy is the de facto standard for scientific
201
computing in Python and the successor of the other numerical Python packages
202
Numarray \cite{Dub96} and numeric \cite{Asc99}.
204
\subsubsection{NumPy's ndarray}
206
The primary class defined by NumPy is \verb=ndarray=. The \verb=ndarray= is
207
implemented as a contiguous memory segment. Internally, all \verb=ndarray=
208
instances have a pointer to the location of the first element as well as the
209
attributes \verb=shape=, \verb=ndim=, and \verb=strides=. \verb=ndim=
210
describes the number of dimensions in the array, \verb=shape= describes the
211
number of elements in each dimension, and \verb=strides= describes the number
212
of bytes between consecutive elements per dimension. The \verb=ndarray= can be
213
either FORTRAN- or C-ordered. Recall that in FORTRAN, the first dimension has
214
a stride of one while it is the opposite (last) dimension in C. \verb=shape=
215
can be modified while \verb=ndim= and \verb=strides= are read-only and used
216
internally, although their exposure to the programmer may help in developing
219
The creation of \verb=ndarray= instances is complicated by the various ways in
220
which it can be done such as explicit constructor calls, view casting, or
221
creating new instances from template instances (e.g. slicing). To this end,
222
the \verb=ndarray= does not implement Python’s \verb=__init__()= object
223
constructor. Instead, \verb=ndarrays= use the \verb=__new__()=
224
\verb=classmethod=. Recall that \verb=__new__()= is Python’s hook for
225
subclassing its built-in objects. If \verb=__new__()= returns an instance of
226
the class on which it is defined, then the class's \verb=__init__()= method is
227
also called. Otherwise, the \verb=__init__()= method is not called. Given the
228
various ways that \verb=ndarray= instances can be created, the
229
\verb=__new__()= \verb=classmethod= might not always get called to properly
230
initialize the instance. \verb=__array_finalize__()= is called instead of
231
\verb=__init__()= for \verb=ndarray= subclasses to avoid this limitation.
233
\subsubsection{Slicing}
235
Unlike slicing with built-in Python sequences, slicing in NumPy is performed
236
per axis. Each sliced axis is separated by commas within the usual bracket
237
notation. Further, slicing in NumPy produces "views" rather than copies of the
238
original ndarray. If a copy of the result is truly desired, it can be
239
explicitly requested. This allows operations on subarrays without unnecessary
240
copying of data. To the programmer, \texttt{ndarray}s behave the same whether
241
they are the result of a slicing operation. Views have an additional
242
attribute, \texttt{base}, assigned that points to the \texttt{ndarray} that
243
owns the data. The original \texttt{ndarray}’s \texttt{base} is \texttt{None}
244
(effectively a null pointer.) When an \texttt{ndarray} is sliced, the
245
resulting \texttt{ndarray} may have different \texttt{shape},
246
\texttt{strides}, and \texttt{ndim} attributes appropriately. There is no
247
restriction on taking slices of already sliced \texttt{ndarray}s, either.
249
\subsubsection{NumPy's Universal Functions}
251
The element-wise operators in NumPy are known as Universal Functions, or
252
ufuncs. Many of the methods of \verb=ndarray= simply invoke the corresponding
253
ufunc. For example, the operator \verb=+= calls \verb=ndarray.__add__()= which
254
invokes the ufunc \verb=add=. Ufuncs are either unary or binary, taking either
255
one or two arrays as input, respectively. Ufuncs always return the result of
256
the operation as an \verb=ndarray= or \verb=ndarray= subclass. Optionally, an
257
additional output parameter may be specified to receive the results of the
258
operation. Specifying this output parameter to the ufunc avoids the sometimes
259
unnecessary creation of a new \verb=ndarray=.
261
Ufuncs are more than just callable functions. They also have some special
262
methods such as \verb=reduce()= and \verb=accumulate()=. \verb=reduce()= is
263
similar to Python’s built-in function of the same name that repeatedly applies
264
a callable object to its last result and the next item of the sequence. This
265
effectively reduces a sequence to a single value. When applied to arrays the
266
reduction occurs along the first axis by default, but other axes may be
267
specified. Each ufunc defines the function that is used for the reduction. For
268
example, \verb=add.reduce()= will sum the values along an axis while
269
\verb=multiply.reduce()= will generate the running product.
270
\verb=accumulate()= is similar to \verb=reduce()=, but it returns the
271
intermediate results of the reduction.
273
Ufuncs can operate on \verb=ndarray= subclasses or array-like objects. In
274
order for subclasses of the \verb=ndarray= or array-like objects to utilize
275
the ufuncs, they may define three methods or one attribute which are
276
\verb=__array_prepare__()=, \verb=__array_wrap__()=, \verb=__array__()=, and
277
\verb=__array_priority__=, respectively. The\linebreak
278
\verb=__array_prepare__()= and \verb=__array_wrap__()= methods will be called
279
on either the output, if specified, or the input with the highest
280
\verb=__array_priority__=. \verb=__array_prepare__()= is called on the way
281
into the ufunc after the output array is created but before any computation
282
has been performed and \verb=__array_wrap__()= is called on the way out of the
283
ufunc. Those two functions exist so that \verb=ndarray= subclasses can
284
properly modify any attributes or properties specific to their subclass.
285
Lastly, if an output is specified which defines an \verb=__array__()= method,
286
results will be written to the object returned by calling \verb=__array__()=.
288
\subsubsection{Broadcasting}
290
NumPy introduces the powerful feature of allowing otherwise incompatible
291
arrays to be used as operands in element-wise operations. If the number of
292
dimensions do not match for two arrays, 1’s are repeatedly prepended to the
293
shape of the array with the least number of dimensions until their ndims
294
match. Arrays are then broadcast-compatible (also \emph{broadcastable}) if for
295
each of their dimensions their shapes either match or one of them is equal to
296
1. For example, the shapes \verb=(3, 4, 5)= and \verb=(2, 3, 4, 1)= are
297
broadcastable. In this way, scalars can be used as operands in element-wise
298
array operations since they will be broadcast to match any other array.
299
Broadcasting relies on the \texttt{strides} attribute of the \texttt{ndarray}.
300
A stride of 0 effectively causes the data for that dimension to repeat, which
301
is precisely what happens when broadcasting occurs in element-wise array
304
\subsection{Parallel Programming Paradigms}
306
Parallel applications can be classified into a few well defined programming
307
paradigms. Each paradigm is a class of algorithms that have the same control
308
structure. The literature differs in how these paradigms are classified and
309
the boundaries between paradigms can sometimes be fuzzy or intentionally
310
blended into hybrid models \cite{Buy99}.
312
\subsubsection{Master/Slave}
314
The master/slave paradigm, also known as task-farming, is where a single
315
master process farms out tasks to multiple slave processes. The control is
316
always maintained by the master, dispatching commands to the slaves. Usually,
317
the communication takes place only between the master and slaves. This model
318
may either use static or dynamic load-balancing. The former involves the
319
allocation of tasks to happen when the computation begins whereas the latter
320
allows the application to adjust to changing conditions within the
321
computation. Dynamic load-balancing may involve recovering after the failure
322
of a subset of slave processes or handling the case where the number of tasks
323
is not known at the start of the application.
325
\subsubsection{Single Program, Multiple Data}
327
With SPMD, each process executes essentially the same code but on a different
328
part of the data. The communication pattern is highly structured and
329
predictable. Occasionally, a global synchronization may be needed. The
330
efficiency of these types of programs depends on the decomposition of the data
331
and the degree to which the data is independent of its neighbors. These
332
programs are also highly susceptible to process failure. If any single
333
process fails, generally it causes deadlock since global synchronizations
334
thereafter would fail.
336
\subsection{Message Passing Interface (MPI)}
338
Message passing libraries allow efficient parallel programs to be written for
339
distributed memory systems. MPI \cite{Gro99a}, also known as MPI-1, is a
340
library specification for message-passing that was standardized in May 1994 by
341
the MPI Forum. It is designed for high performance on both massively parallel
342
machines and on workstation clusters. An optimized MPI implementation exists
343
on nearly all modern parallel systems and there are a number of freely
344
available, portable implementations for all other systems \cite{Buy99}. As
345
such, MPI is the de facto standard for writing massively parallel application
346
codes in either FORTRAN, C, or C++.
348
The MPI-2 standard \cite{Gro99b} was first completed in 1997 and added a
349
number of important additions to MPI including, but not limited to, one-sided
350
communication and the C++ language binding. Before MPI-2, all communication
351
required explicit handshaking between the sender and receiver via
352
\verb=MPI_Send()= and \verb=MPI_Recv()= in addition to non-blocking variants.
353
MPI-2’s one-sided communication model allows reads, writes, and accumulates of
354
remote memory without the explicit cooperation of the process owning the
355
memory. If synchronization is required at a later time, it can be requested
356
via \verb=MPI_Barrier()=. Otherwise, there is no strict guarantee that a
357
one-sided operation will complete before the data segment it accessed is used
362
mpi4py is a Python wrapper around MPI. It is written to mimic the C++ language
363
bindings. It supports point-to-point communication, one-sided communication,
364
as well as the collective communication models. Typical communication of
365
arbitrary objects in the FORTRAN or C bindings of MPI require the programmer
366
to define new MPI datatypes. These datatypes describe the number and order of
367
the bytes to be communicated. On the other hand, strings could be sent without
368
defining a new datatype so long as the length of the string was understood by
369
the recipient. mpi4py is able to communicate any serializable Python object
370
since serialized objects are just byte streams. mpi4py also has special
371
enhancements to efficiently communicate any object implementing Python’s
372
buffer protocol, such as NumPy arrays. It also supports dynamic process
373
management and parallel I/O \cite{Dal05,Dal08}.
375
\subsection{Global Arrays and Aggregate Remote Memory Copy Interface}
377
The GA toolkit \cite{Nie06,Nie10,Pnl11} is a software system from Pacific
378
Northwest National Laboratory that enables an efficient, portable, and
379
parallel shared-memory programming interface to manipulate physically
380
distributed dense multidimensional arrays, without the need for explicit
381
cooperation by other processes. GA compliments the message-passing programming
382
model and is compatible with MPI so that the programmer can use both in the
383
same program. GA has supported Python bindings since version 5.0. Arrays are
384
created by calling one of the creation routines such as \verb=ga.create()=,
385
returning an integer handle which is passed to subsequent operations. The GA
386
library handles the distribution of arrays across processes and recognizes
387
that accessing local memory is faster than accessing remote memory. However,
388
the library allows access mechanisms for any part of the entire distributed
389
array regardless of where its data is located. Local memory is acquired via
390
\verb=ga.access()= returning a pointer to the data on the local process, while
391
remote memory is retrieved via \verb=ga.get()= filling an already allocated
392
array buffer. Individual discontiguous sets of array elements can be updated
393
or retrieved using \verb=ga.scatter()= or \verb=ga.gather()=, respectively.
394
GA has been leveraged in several large computational chemistry codes and has
395
been shown to scale well \cite{Apr09}.
397
The Aggregate Remote Memory Copy Interface (ARMCI) provides general-purpose,
398
efficient, and widely portable remote memory access (RMA) operations
399
(one-sided communication). ARMCI operations are optimized for contiguous and
400
non-contiguous (strided, scatter/gather, I/O vector) data transfers. It also
401
exploits native network communication interfaces and system resources such as
402
shared memory \cite{Nie00}. ARMCI provides simpler progress rules and a less
403
synchronous model of RMA than MPI-2. ARMCI has been used to implement the
404
Global Arrays library, GPSHMEM - a portable version of Cray SHMEM library, and
405
the portable Co-Array FORTRAN compiler from Rice University \cite{Dot04}.
409
Cython \cite{Beh11} is both a language which closely resembles Python as well
410
as a compiler which generates C code based on Python's C API. The Cython
411
language additionally supports calling C functions as well as static typing.
412
This makes writing C extensions or wrapping external C libraries for the
413
Python language as easy as Python itself.
415
\section{Related Work}
417
GAiN is similar in many ways to other parallel computation software packages.
418
It attempts to leverage the best ideas for transparent, parallel processing
419
found in current systems. The following packages provided insight into how
420
GAiN was to be developed.
424
MITMatlab \cite{Hus98}, which was later rebranded as Star-P \cite{Ede07},
425
provides a client-server model for interactive, large-scale scientific
426
computation. It provides a transparently parallel front end through the
427
popular MATLAB \cite{Pal07} numerical package and sends the parallel
428
computations to its Parallel Problem Server. Star-P briefly had a Python
429
interface. Separating the interactive, serial nature of MATLAB from the
430
parallel computation server allows the user to leverage both of their
431
strengths. This also allows much larger arrays to be operated over than is
432
allowed by a single compute node.
434
\subsection{Global Arrays Meets MATLAB}
436
Global Arrays Meets MATLAB (GAMMA) \cite{Pan06} provides a MATLAB binding to
437
the GA toolkit, thus allowing for larger problem sizes and parallel
438
computation. GAMMA can be viewed as a GA implementation of MITMatlab and was
439
shown to scale well even within an interpreted environment like MATLAB.
443
IPython \cite{Per07} provides an enhanced interactive Python shell as well as
444
an architecture for interactive parallel computing. IPython supports
445
practically all models of parallelism but, more importantly, in an interactive
446
way. For instance, a single interactive Python shell could be controlling a
447
parallel program running on a supercomputer. This is done by having a Python
448
engine running on a remote machine which is able to receive Python commands.
450
\subsection{IPython's distarray}
452
distarray \cite{Per09} is an experimental package for the IPython project.
453
distarray uses IPython’s architecture as well as MPI extensively in order to
454
look and feel like NumPy \verb=ndarray= instances. Only the SPMD model of
455
parallel computation is supported, unlike other parallel models supported
456
directly by IPython. Further, the status of distarray is that of a
457
proof-of-concept and not production ready.
459
\subsection{DistNumPy}
461
DistNumPy \cite{Kri10} extends the NumPy module with a new, distributed
462
backend for NumPy arrays. The distributed nature of DistNumPy is almost fully
463
transparent to the user, must like the work described in this paper. It is
464
able to parallelize any NumPy application by having the master process
465
dispatch parallel NumPy operations to the slaves while the master is able to
466
execute any serial user code. However, certain features limit its utility.
467
DistNumPy cannot be used in conjunction with a standard NumPy installation
468
because it implements the same \texttt{numpy} namespace. DistNumPy is a fork
469
of v1.3 of the NumPy codebase and it is unclear how much effort would be
470
involved in bringing it up-to-date with the ehancements found in the latest
471
NumPy v1.6 release. The block cyclic distribution used by DistNumPy works well
472
for some packages, but it is not ideal for all problem types. The user is not
473
allowed to alter the distribution scheme used by DistNumPy. Lastly, DistNumPy
474
does not adequately handle problems that involve non-aligned array views.
478
A Graphics Processing Unit (GPU) is a powerful parallel processor that is
479
capable of more floating point calculations per second than a traditional CPU.
480
However, GPUs are more difficult to program and require other special
481
considerations such as copying data from main memory to the GPU’s on-board
482
memory in order for it to be processed, then copying the results back. The
483
GpuPy \cite{Eit07} Python extension package was developed to lessen these
484
burdens by providing a NumPy-like interface for the GPU. Preliminary results
485
demonstrate considerable speedups for certain single-precision floating point
490
A subset of the Global Arrays toolkit was wrapped in Python for the 3.x series
491
of GA by Robert Harrison \cite{Har99}. It illustrated some important concepts
492
such as the benefits of integration with NumPy -- the local or remote portions
493
of the global arrays were retrieved as NumPy arrays at which point they could
494
be used as inputs to NumPy functions like the ufuncs.
496
\subsection{Co-Array Python}
498
Co-Array Python \cite{Ras04} is modeled after the Co-Array FORTRAN extensions
499
to FORTRAN 95. It allows the programmer to access data elements on non-local
500
processors via an extra array dimension, called the co-dimension. The
501
\verb=CoArray= module provided a local data structure existing on all
502
processors executing in a SPMD fashion. The CoArray was designed as an
503
extension to Numeric Python \cite{Asc99}.
507
The need for parallel programming and running these programs on parallel
508
architectures is obvious, however, efficiently programming for a parallel
509
environment can be a daunting task. One area of research is to automatically
510
parallelize otherwise serial programs and to do so with the least amount of
511
user intervention \cite{Buy99}. GAiN attempts to do this for certain Python
512
programs utilizing the NumPy module. It will be shown that some NumPy programs
513
can be parallelized in a nearly transparent way with GAiN.
515
There are a few assumptions which govern the design of GAiN. First, all
516
documented GAiN functions are collective. Since Python and NumPy were designed
517
to run serially on workstations, it naturally follows that GAiN, running in an
518
SPMD fashion, will execute every documented function collectively. Second,
519
only certain arrays should be distributed. In general, it is inefficient to
520
distribute arrays which are relatively small and/or easy to compute. It
521
follows, then, that GAiN operations should allow mixed inputs of both
522
distributed and local array-like objects. Further, NumPy represents an
523
extensive, useful, and hardened API. Every effort to reuse NumPy should be
524
made. Lastly, GA has its own strengths to offer such as processor groups and
525
custom data distributions. In order to maximize scalability of this
526
implementation, we should enable the use of processor groups \cite{Nie05}.
528
A distributed array representation must acknowledge the duality of a global
529
array and the physically distributed memory of the array. Array attributes
530
such as \verb=shape= should return the global, coalesced representation of the
531
array which hides the fact the array is distributed. But when operations such
532
as \verb=add()= are requested, the corresponding pieces of the input arrays
533
must be operated over. Figure \ref{fig:1} will help illustrate. Each local
534
piece of the array has its own shape (in parenthesis) and knows its portion of
535
the distribution (in square brackets). Each local piece also knows the global
540
\includegraphics[width=0.47\textwidth]{image1_crop.eps}
542
Each local piece of the gain.ndarray has its own shape (in parenthesis)
543
and knows its portion of the distribution (in square brackets). Each local
544
piece also knows the global shape.
549
A fundamental design decision was whether to subclass \verb=ndarray= or to
550
provide a work-alike replacement for the entire \verb=numpy= module. The NumPy
551
documentation states that \verb=ndarray= implements \verb=__new__()= in order
552
to control array creation via constructor calls, view casting, and slicing.
553
Subclasses implement \verb=__new__()= for when the constructor is called
554
directly, and \linebreak\verb=__array_finalize__()= in order to set additional
555
attributes or further modify the object from which a view has been taken. One
556
can imagine an \verb=ndarray= subclass called \verb=gainarray= circumventing
557
the usual \verb=ndarray= base class memory allocation and instead allocating a
558
smaller \verb=ndarray= per process while retaining the global \verb=shape=.
559
One problem occurs with view casting -- with this approach the other
560
\verb=ndarray= subclasses know nothing of the distributed nature of the memory
561
within the \verb=gainarray=. NumPy itself is not designed to handle
562
distributed arrays. By design, ufuncs create an output array when one is not
563
specified. The first hook which NumPy provides is \verb=__array_prepare__()=
564
which is called \emph{after the output array has been created}. This means any
565
ufunc operation on one or more \verb=gainarray= instances without a specified
566
output would automatically allocate the entire output on each process. For
567
this reason alone, we opted to reimplement the entire \verb=numpy= module,
568
controlling all aspects of array creation and manipulation to take into
569
account distributed arrays.
571
We present a new Python module, \verb=gain=, developed as part of the main
572
Global Arrays software distribution. The release of GA v5.0 contained Python
573
bindings based on the complete GA C API, available in the extension module
574
\verb=ga=. The GA bindings as well as the \verb=gain= module were developed
575
using Cython. With the upcoming release of GA v5.1, the module \verb=ga.gain=
576
is available as a drop-in replacement for NumPy. The goal of the
577
implementation is to allow users to write
578
\begin{minted}[]{python}
579
import ga.gain as numpy
581
and then to execute their code using the MPI process manager
582
\begin{minted}[]{bash}
583
mpiexec -np 4 python script.py
586
In order to succeed as a drop-in replacement, all attributes, functions,
587
modules, and classes which exist in \verb=numpy= must also exist within
588
\verb=gain=. Efforts were made to reuse as much of \verb=numpy= as possible,
589
such as its type system. As of GA v5.1, arrays of arbitrary fixed-size element
590
types and sizes can be created and individual fields of C \verb=struct= data
591
types accessed directly. GAiN is able to use the \verb=numpy= types when
592
creating the GA instances which back the \verb=gain.ndarray= instances. GAiN
593
is able to use an unmodified NumPy installation and will work with any
594
sufficiently recent NumPy release.
596
GAiN follows the owner-computes rule \cite{Zim88}. The rule assigns each
597
computation to the processor that owns the data being computed. Figures
598
\ref{fig:2} and \ref{fig:3} illustrate the concept. For any array computation,
599
GAiN bases the computation on the output array. The processes owning portions
600
of the output array will acquire the corresponding pieces of the input
601
array(s) and then perform the computation locally, \emph{calling the original
602
NumPy routine} on the corresponding array portions. In some cases, for example
603
if the output array is a view created by a slicing operation, certain
604
processors will have no computation to perform.
608
\includegraphics[width=0.47\textwidth]{image3_crop.eps}
610
Add two arrays with the same data distribution. There are eight processors for
611
this computation. Following the owner-computes rule, each process owning a
612
piece of the output array (far right) retrieves the corresponding pieces from
613
the sliced input arrays (left and middle). For example, the corresponding gold
614
elements will be computed locally on the owning process. Note that for this
615
computation, the data distribution is the same for both input arrays as well
616
as the output array such that communication can be avoided by using local data
624
\includegraphics[width=0.47\textwidth]{image2_crop.eps}
626
Add two sliced arrays. There are eight processors for this computation. The
627
elements in blue were removed by a slice operation. Following the
628
owner-computes rule, each process owning a piece of the output array (far
629
right) retrieves the corresponding pieces from the sliced input arrays (left
630
and middle). For example, the corresponding gold elements will be computed
631
locally on the owning process. Similarly for the copper elements. Note that
632
for this computation, the data for each array is not equivalently distributed
633
which will result in communication.
638
The GAiN implementation of the \verb=ndarray= implements a few important
639
concepts including the dual nature of a global array and its individual
640
distributed pieces, slice arithmetic, and separating collective operations
641
from one-sided operations. When a \verb=gain.ndarray= is created, it creates a
642
Global Array of the same shape and type and stores the GA integer handle. The
643
distribution on a given process can be queried using \verb=ga.distribution()=.
644
The other important attribute of the \verb=gain.ndarray= is the
645
\verb=global_slice=. The \verb=global_slice= begins as a list of \verb=slice=
646
objects based on the original \verb=shape= of the array.
648
\begin{minted}[]{python}
649
self.global_slice = [slice(0,x,1) for x in shape]
652
Slicing a \verb=gain.ndarray= must return a view just like slicing a
653
\verb=numpy.ndarray= returns a view. The approach taken is to apply the
654
\verb=key= of the \verb=__getitem__(key)= request to the \verb=global_slice=
655
and store the new \verb=global_slice= on the newly created view. We call this
656
type of operation slice arithmetic. First, the \verb=key= is canonicalized
657
meaning \verb=Ellipsis= are replaced with \verb=slice(0,dim_max,1)= for each
658
dimension represented by the \verb=Ellipsis=, all \verb=slice= instances are
659
replaced with the results of calling \verb=slice.indices()=, and all negative
660
index values are replaced with their positive equivalents. This step ensures
661
that the length of the \verb=key= is compatible with and based on the current
662
shape of the array. This enables consistent slice arithmetic on the
663
canonicalized keys. Slice arithmetic effectively produces a new \verb=key=
664
which, when applied to the same original array, produces the same results had
665
the same sequence of keys been applied in order. Figures \ref{fig:slice1}
666
and \ref{fig:figslice2} illustrate this concept.
670
\includegraphics[width=0.47\textwidth]{image4a_crop.eps}
672
Slice arithmetic example 1. Array $b$ could be created either using the
673
standard notation (top middle) or using the canonicalized form (bottom
674
middle). Array $c$ could be created by applying the standard notation (top
675
right) or by applying the equivalent canonical form (bottom right) to the
683
\includegraphics[width=0.47\textwidth]{image4b_crop.eps}
685
Slice arithmetic example 2. See the caption of Figure \ref{fig:slice1} for
688
\label{fig:figslice2}
691
When performing calculations on a \verb=gain.ndarray=, the current
692
\verb=global_slice= is queried when accessing the local data or fetching
693
remote data such that an appropriate \verb=ndarray= data block is returned.
694
Accessing local data and fetching remote data is performed by the
695
\verb=gain.ndarray.access()= and \verb=gain.ndarray.get()= methods,
696
respectively. Figure \ref{fig:accessget} illustrates how \verb=access()= and
697
\verb=get()= are used. The \verb=ga.access()= function on which\linebreak
698
\verb=gain.ndarray.access()= is based will always return the entire block
699
owned by the calling process. The returned piece must be further sliced to
700
appropriately match the current \verb=global_slice=. The
701
\verb=ga.strided_get()= function on which \verb=gain.ndarray.get()= method is
702
based will fetch data from other processes without the remote processes'
703
cooperation i.e. using one-sided communication. The calling process specifies
704
the region to fetch based on the current view's \verb=shape= of the array. The
705
\verb=global_slice= is adjusted to match the requested region using slice
706
arithmetic and then transformed into a \verb=ga.strided_get()= request based
707
on the global, original shape of the array.
711
\includegraphics[width=0.47\textwidth]{image5_crop.eps}
713
\texttt{access()} and \texttt{get()} examples. The current
714
\texttt{global\_slice}, indicated by blue array elements, is respected in
715
either case. A process can access its local data block for a given array (red
716
highlight). Note that \texttt{access()} returns the entire block, including
717
the sliced elements. Any process can fetch any other processes' data using
718
\texttt{get()} with respect to the current \texttt{shape} of the array (blue
719
highlight). Note that the fetched block will not contain the sliced elements,
720
reducing the amount of data communicated.
722
\label{fig:accessget}
725
Recall that GA allows the contiguous, process-local data to be accessed using
726
\verb=ga.access()= which returns a C-contiguous \verb=ndarray=. However, if
727
the \verb=gain.ndarray= is a view created by a slice, the data which is
728
accessed will be contiguous while the view is not. Based on the distribution
729
of the process-local data, a new slice object is created from the
730
\verb=global_slice= and applied to the accessed \verb=ndarray=, effectively
731
having applied first the \verb=global_slice= on the global representation of
732
the distributed array followed by a slice representing the process-local
735
After process-local data has been accessed and sliced as needed, it must then
736
fetch the remote data. This is again done using \verb=ga.get()= or
737
\verb=ga.strided_get()= as above. Recall that one-sided communication, as
738
opposed to two-sided communication, does not require the cooperation of the
739
remote process(es). The local process simply fetches the corresponding array
740
section by performing a similar transformation to the target array's
741
\verb=global_slice= as was done to access the local data, and then translates
742
the modified \verb=global_slice= into the proper arguments for \verb=ga.get()=
743
if the \verb=global_slice= does not contain any \verb=step= values greater
744
than one, or \verb=ga.strided_get()= if the \verb=global_slice= contained
745
\verb=step= values greater than one.
747
One limitation of using GA is that GA does not allow negative stride values
748
corresponding to the negative \verb=step= values allowed for Python sequences
749
and NumPy arrays. Supporting negative \verb=step= values for GAiN required
750
special care -- when a negative \verb=step= is encountered during a slice
751
operation, the slice is applied as usual. However, prior to accessing or
752
fetching data, the slice is inverted from a negative \verb=step= to a positive
753
\verb=step= and the \verb=start= and \verb=stop= values are updated
754
appropriately. The \verb=ndarray= which results from accessing or fetching
755
based on the inverted slice is then re-inverted, creating the correct view of
758
Another limitation of using GA is that the data distribution cannot be changed
759
once an array is created. This complicates such useful functionality as
760
\verb=numpy.reshape()=. Currently, GAiN must make a copy of the array instead
761
of a view when altering the shape of an array.
763
Translating the \verb=numpy.flatiter= class, which assumes a single address
764
space while translating an N-dimensional array into a 1D array, into a
765
distributed form was made simpler by the use of \verb=ga.gather()= and
766
\verb=ga.scatter()=. These two routines allow individual data elements within
767
a GA to be fetched or updated. Flattening a distributed N-dimensional array
768
which had been distributed in blocked fashion will cause the blocks to become
769
discontiguous. Figure \ref{fig:flatten} shows how a $6 \times 6$ array
770
might be distributed and flattened. The \verb=ga.get()= operation assumes the
771
requested patch has the same number of dimensions as the array from which the
772
patch is requested. Reshaping, in general, is made difficult by GA and its
773
lack of a redistribute capability. However, in this case, we can use
774
\verb=ga.gather()= and \verb=ga.scatter()= to fetch and update, respectively,
775
any array elements in any order. \verb=ga.gather()= takes a 1D array-like of
776
indices to fetch and returns a 1D \verb=ndarray= of values. Similarly,
777
\verb=ga.scatter()= takes a 1D array-like of indices to update and a 1D
778
array-like buffer containing the values to use for the update. If a
779
\verb=gain.flatiter= is used as the output of an operation, following the
780
owner-computes rule is difficult. Instead, pseudo-owners are assigned to
781
contiguous slices of the of 1D view. These pseudo-owners gather their own
782
elements as well as the corresponding elements of the other inputs, compute
783
the result, and scatter the result back to their own elements. This results in
784
additional communication which is otherwise avoided by true adherence to the
785
owner-computes rule. To avoid this inefficiency, there are some cases where
786
operating over \verb=gain.flatiter= instances can be optimized, for example
787
with \verb=gain.dot()= if the same \verb=flatiter= is passed as both inputs,
788
the \verb=base= of the \verb=flatiter= is instead multiplied together
789
element-wise and then the \verb=gain.sum()= is taken of the resulting array.
793
\includegraphics[width=0.47\textwidth]{image6_crop.eps}
795
Flattening a 2D distributed array. The block owned by a process becomes
796
discontiguous when representing the 2D array in 1 dimension.
801
Not all NumPy programs are suitable for SPMD execution. If the NumPy program
802
has made assumptions about a single process environment, there must be a
803
separation between Python running as a single process and a parallel back-end
804
where the parallel computation is performed. For example, the program might
805
open a connection to a database. Doing so with multiple processes and in
806
addition having each process make multiple updates to that database would be
807
disastrous. Explicit knowledge on the user’s part of the parallel nature of
808
the execution environment would be needed to mitigate database access.
810
NumPy programs unsuitable for SPMD execution can be handled by utilizing
811
master/slave parallelism. The original program assumes the role of master
812
while any expression involving a \texttt{gain.ndarray} is serialized and sent
813
to an SPMD parallel back-end. Unless the data being operated on already exists
814
on the slaves, both the data and the function to operate on the data must be
815
sent. To accomplish this master/slave parallelism, we use the process
816
management features of the MPI-2 specification as well as custom Python code
817
to serialize the objects.
821
The success of GAiN hinges on its ability to enable distributed array
822
processing in NumPy, to transparently enable this processing, and most
823
importantly to efficiently accomplish those goals. Performance Python
824
\cite{Ram08} “perfpy” was conceived to demonstrate the ways Python can be used
825
for high performance computing. It evaluates NumPy and the relative
826
performance of various Python extensions to NumPy. It represents an important
827
benchmark by which any additional high performance numerical Python module
828
should be measured. The original program \verb=laplace.py= was modified by
829
importing \verb=ga.gain= in place of \verb=numpy= and then stripping the
830
additional test codes so that only the \verb=gain= (\verb=numpy=) test
831
remained. The latter modification makes no impact on the timing results since
832
all tests are run independently but was necessary because \verb=gain= is run
833
on multiple processes while the original test suite is serial. The program
834
was run on the chinook supercomputer at the Environmental Molecular Sciences
835
Laboratory, part of Pacific Northwest National Laboratory. Chinook consists
836
of 2310 HP DL185 nodes with dual socket, 64-bit, Quad-core AMD 2.2 GHz Opteron
837
processors. Each node has 32 Gbytes of memory for 4 Gbytes per core. Fast
838
communication between the nodes is obtained using a single rail Infiniband
839
interconnect from Voltaire (switches) and Melanox (NICs). The system runs a
840
version of Linux based on Red Hat Linux Advanced Server. GAiN utilized up to
841
512 nodes of the cluster, using 4 cores per node.
843
In Figure \ref{fig:laplace}, GAiN is shown to scale up to 2K cores on a modest
844
problem size. However, the startup cost of the Python interpreter became
845
prohibitively large such that a 4K core test could not be run in a reasonable
846
amount of time. It is encouraging that the implementation of GAiN allows the
847
algorithm to scale so long as the Python interpreter can be sufficiently
848
modified to scale, as well.
850
GAiN is also able to run on problems which are not feasible on workstations.
851
For example, to store one 100,000x100,000 matrix of double-precision numbers
852
requires approximately 75GB.
856
\includegraphics[width=0.47\textwidth]{laplace.eps}
858
\texttt{laplace.py} for N=10,000 and N=100,000. For N=10,000, one matrix
859
of double-precision numbers is approximately 0.75GB. For this problem, GAiN
860
scales up to 2K cores in terms of the solver algorithm, however, the Python
861
interpreter fails to scale beyond a few hundred cores in this case. For
862
N=100,000, one matrix of double-precision numbers is approximately 75GB. In
863
addition to handling this large-scale problem, GAiN continues to scale up to
864
2K cores. Again, the startup cost of the Python interpreter prohibited the 4K
865
run from starting in a reasonable amount of time.
870
During the evaluation, it was noted that a lot of time was spent within global
871
synchronization calls e.g. \verb=ga.sync()=. The source of the calls was
872
traced to, among other places, the vast number of temporary arrays getting
873
created. Using GA statistics reporting, the original \verb=laplace.py= code
874
created 912 arrays and destroyed 910. Given this staggering figure, an array
875
cache was created. The cache is based on a Python \verb=dict= using the shape
876
and type of the arrays as the keys and stores discarded GA instances
877
represented by the GA integer handle. The number of GA handles stored per
878
shape and type is referred to as the cache depth. The \verb=gain.ndarray=
879
instances are discarded as usual. Utilizing the cache keeps the GA memory
880
from many allocations and deallocations but primarily avoids many
881
synchronization calls. Three cache depths were tested, as shown in Table
882
\ref{tab:cache}. The trade-off of using this cache is that if the arrays used
883
by an application vary wildly in size or type, this cache will consume too
884
much memory. Other heuristics could be developed to keep the cache from using
885
too much memory e.g. a maximum size of the cache, remove the least used
886
arrays, remove the least recently used. Based on the success of the GA cache,
887
it is currently used by GAiN.
891
%\begin{tabular}{|l|l|l|}
892
\begin{tabular}{l|l|l}
894
& Create & Destroy \\
896
No Cache & 912 & 910 \\
898
Depth-1 Cache & 311 & 306 \\
900
Depth-2 Cache & 110 & 102 \\
902
Depth-3 Cache & 11 & 1 \\
906
How array caching affects GA array creation/destruction counts when running
907
\texttt{laplace.py} for 100 iterations. The smaller numbers indicate better
908
reuse of GA memory and avoidance of global synchronization calls, at the
909
expense of using additional memory.
917
GAiN succeeds in its ability to grow problem sizes beyond a single compute
918
node. The performance of the perfpy code and the ability to drop-in GAiN
919
without modification of the core implementation demonstrates its utility. As
920
described previously, GAiN allows certain classes of existing NumPy programs
921
to run using GAiN with sometimes as little effort as changing the import
922
statement, immediately taking advantage of the ability to run in a cluster
923
environment. Once a smaller-sized program has been developed and tested on a
924
desktop computer, it can then be run on a cluster with very little effort.
925
GAiN provides the groundwork for large distributed multidimensional arrays
928
\section{Future Work}
930
GAiN is not a complete implementation of the NumPy API nor does it represent
931
the only way in which distributed arrays can be achieved for NumPy.
932
Alternative parallelization strategies besides the owner-computes rule should
933
be explored. GA allows for the get-compute-put model of computation where
934
ownership of data is largely ignored while data movement costs are increased.
935
However, the get-compute-put model of computation may allow an algorithm to
936
mix communication and computation, improving overall performance as is done
937
with the SRUMMA dense matrix multiplication algorithm \cite{Kri04}.
939
Task parallelism could also be explored if load balancing becomes an issue.
940
The scioto framework \cite{Din08} interoperates with MPI and GA and provides
941
locality-aware dynamic load balancing via management of shared collections of
942
task objects. Certain NumPy functionality may better lend itself to task
943
parallelism instead of owner-computes. Task parallelism decouples data
944
ownership from the computation and is more akin to the get-compute-put model.
945
Further, GAiN does not support block cyclic data distributions. Block cyclic
946
distributions may alleviate load imbalance but may also incur additional
949
The GA cache should be exposed as a tunable parameter since the current
950
heuristic of reusing arrays with the same type and shape does not apply to
951
dynamic applications. For example, while not a true application, the current
952
test suite for GAiN creates many arrays of different shapes and types. Most of
953
those arrays are not reused. Applications of this nature may wish to disable
954
GA caching. Alternative temporary array creation strategies and heuristics
955
could also be developed.
957
Although the GA cache is meant to solve the problem of temporary array
958
creation which occurs through normal use of the NumPy API, it does not solve
959
the problem of combining arithmetic operations. Packages such as
960
\emph{numexpr} \cite{Coo11} and \emph{Metagraph} \cite{Wan11} evaluate
961
multiple-operator NumPy expressions many times faster than NumPy. numexpr
962
works by combining arithmetic operations, reducing temporary array creation,
963
and utilizing vectorizing libraries. A similar approach could be utilized by
964
GAiN. The approach taken by numexpr is similar to the approach used by GpuPy
965
discussed earlier. GpuPy utilizes lazy evaluation to store NumPy functions as
966
an abstract syntax tree to combine arithmetic operations on a GPU.
968
Although the master/slave capabilities of GAiN allow for most NumPy
969
applications to parallelize their NumPy portions alone, utilizing the MPI-2
970
specification may not be the preferred approach. The distributed computing
971
capabilities of the IPython project may provide another means of separating
972
the serial from the parallel. The approach taken by DistNumPy does not
973
involve MPI-2 dynamic process management but rather has all processes besides
974
the master process block for task messages during the module import. There may
975
be other ways to separate the serial from the parallel which don't involve
976
dynamic process management such as processor groups or careful use of
977
branching within the code effectively emulating a master/slave situation like
980
The scalability of the Python interpreter remains an open problem. This is the
981
single greatest hurdle to running Python applications at scale. There is a
982
potential solution for diskless systems such as BlueGene/P machines provided
983
by \cite{Scu11}. Clusters with local disks mounted to their compute nodes may
984
find other approaches valuable, as well.
987
%\section{Appendix Title}
989
%This is the text of the appendix, if you need one.
993
A portion of the research was performed using the Molecular Science Computing
994
(MSC) capability at EMSL, a national scientific user facility sponsored by the
995
Department of Energy’s Office of Biological and Environmental Research and
996
located at Pacific Northwest National Laboratory (PNNL). PNNL is operated by
997
Battelle for the U.S. Department of Energy under contract DE-AC05-76RL01830.
999
% We recommend abbrvnat bibliography style.
1001
%\bibliographystyle{abbrvnat}
1002
%\bibliography{bibliography}
1004
\begin{thebibliography}{36}
1005
\providecommand{\natexlab}[1]{#1}
1006
\providecommand{\url}[1]{\texttt{#1}}
1007
\expandafter\ifx\csname urlstyle\endcsname\relax
1008
\providecommand{\doi}[1]{doi: #1}\else
1009
\providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
1011
\bibitem[Pnl(2011)]{Pnl11}
1012
Global arrays webpage.
1013
\newblock http://www.emsl.pnl.gov/docs/global/, 2011.
1015
\bibitem[Apr\'{a} et~al.(2009)Apr\'{a}, Rendell, Harrison, Tipparaju, deJong,
1016
and Xantheas]{Apr09}
1017
E.~Apr\'{a}, A.~P. Rendell, R.~J. Harrison, V.~Tipparaju, W.~A. deJong, and
1019
\newblock Liquid water: obtaining the right answer for the right reasons.
1020
\newblock In \emph{SC}. ACM, 2009.
1021
\newblock ISBN 978-1-60558-744-8.
1023
\bibitem[Ascher et~al.(1999)Ascher, Dubois, Hinsen, Hugunin, , and
1025
D.~Ascher, P.~F. Dubois, K.~Hinsen, J.~Hugunin, , and T.~Oliphant.
1026
\newblock Numerical python.
1027
\newblock Technical Report UCRL-MA-128569, Lawrence Livermore National
1030
\bibitem[Behnel et~al.(2011)Behnel, Bradshaw, Citro, Dalcin, Seljebotn, and
1032
S.~Behnel, R.~Bradshaw, C.~Citro, L.~Dalcin, D.~S. Seljebotn, and K.~Smith.
1033
\newblock Cython: The best of both worlds.
1034
\newblock \emph{Computing in Science Engineering}, 13\penalty0 (2):\penalty0
1037
\bibitem[Buyya(1999)]{Buy99}
1039
\newblock \emph{High Performance Cluster Computing: Architectures and Systems}.
1040
\newblock Prentice Hall PTR, Upper Saddle River, NJ, USA, 1999.
1041
\newblock ISBN 0130137847.
1043
\bibitem[Cooke and Hochberg()]{Coo11}
1044
D.~Cooke and T.~Hochberg.
1046
\newblock http://code.google.com/p/numexpr/.
1048
\bibitem[Daily(2009)]{Dai09}
1050
\newblock Gain: Distributed array computation with python.
1051
\newblock Technical Report PNNL-18355, Pacific Northwest National Laboratory,
1054
\bibitem[Daily and Lewis(2011)]{Dai11}
1055
J.~Daily and R.~R. Lewis.
1056
\newblock Using the global arrays toolkit to reimplement numpy for distributed
1058
\newblock In \emph{Proceedings of the 10th Python in Science Conference}, 2011.
1060
\bibitem[Dalc\'{\i}n et~al.(2005)Dalc\'{\i}n, Paz, and Storti]{Dal05}
1061
L.~Dalc\'{\i}n, R.~Paz, and M.~Storti.
1062
\newblock Mpi for python.
1063
\newblock \emph{J. Parallel Distrib. Comput.}, 65:\penalty0 1108--1115,
1065
\newblock ISSN 0743-7315.
1066
\newblock \doi{http://dx.doi.org/10.1016/j.jpdc.2005.03.010}.
1068
\bibitem[Dalc\'{\i}n et~al.(2008)Dalc\'{\i}n, Paz, Storti, and
1070
L.~Dalc\'{\i}n, R.~Paz, M.~Storti, and J.~D'El\'{\i}a.
1071
\newblock Mpi for python: Performance improvements and mpi-2 extensions.
1072
\newblock \emph{J. Parallel Distrib. Comput.}, 68:\penalty0 655--662, May 2008.
1073
\newblock ISSN 0743-7315.
1074
\newblock \doi{10.1016/j.jpdc.2007.09.005}.
1076
\bibitem[Dinan et~al.(2008)Dinan, Krishnamoorthy, Larkins, Nieplocha, and
1078
J.~Dinan, S.~Krishnamoorthy, D.~B. Larkins, J.~Nieplocha, and P.~Sadayappan.
1079
\newblock Scioto: A framework for global-view task parallelism.
1080
\newblock In \emph{ICPP}, pages 586--593. IEEE Computer Society, 2008.
1082
\bibitem[Dotsenko et~al.(2004)Dotsenko, Coarfa, and Mellor-Crummey]{Dot04}
1083
Y.~Dotsenko, C.~Coarfa, and J.~Mellor-Crummey.
1084
\newblock A multi-platform co-array fortran compiler.
1085
\newblock In \emph{Proceedings of the 13th International Conference on Parallel
1086
Architectures and Compilation Techniques}, PACT '04, pages 29--40,
1087
Washington, DC, USA, 2004. IEEE Computer Society.
1088
\newblock ISBN 0-7695-2229-7.
1089
\newblock \doi{http://dx.doi.org/10.1109/PACT.2004.3}.
1091
\bibitem[Dubois et~al.(1996)Dubois, Hinsen, and Hugunin]{Dub96}
1092
P.~F. Dubois, K.~Hinsen, and J.~Hugunin.
1093
\newblock Numerical python.
1094
\newblock \emph{Computers in Physics}, 10, May/June 1996.
1096
\bibitem[Edelman(2007)]{Ede07}
1098
\newblock The star-p high performance computing platform.
1099
\newblock In \emph{Acoustics, Speech and Signal Processing, 2007. ICASSP 2007.
1100
IEEE International Conference on}, volume~4, pages IV--1197 --IV--1200, april
1102
\newblock \doi{10.1109/ICASSP.2007.367290}.
1104
\bibitem[Eitzen(2007)]{Eit07}
1106
\newblock Gpupy: Efficiently using a gpu with python.
1107
\newblock Master's thesis, Washington State University, 2007.
1109
\bibitem[Gropp et~al.(1999{\natexlab{a}})Gropp, Lusk, and Skjellum]{Gro99a}
1110
W.~Gropp, E.~Lusk, and A.~Skjellum.
1111
\newblock \emph{Using MPI: Portable Parallel Programming with the
1112
Message-Passing Interface}.
1113
\newblock MIT Press, 2 edition, November 1999{\natexlab{a}}.
1114
\newblock ISBN 978-0-262-57132-6.
1116
\bibitem[Gropp et~al.(1999{\natexlab{b}})Gropp, Lusk, and Thakur]{Gro99b}
1117
W.~Gropp, E.~Lusk, and R.~Thakur.
1118
\newblock \emph{Using MPI-2: Advanced Features of the Message-Passing
1120
\newblock MIT Press, November 1999{\natexlab{b}}.
1121
\newblock ISBN 978-0-262-57133-3.
1123
\bibitem[Harrison(1999)]{Har99}
1125
\newblock Global arrays python interface.
1126
\newblock http://www.emsl.pnl.gov/docs/global/old/pyGA, December 1999.
1128
\bibitem[Husbands and Isbell(1999)]{Hus98}
1129
P.~Husbands and C.~Isbell.
1130
\newblock The parallel problems server: A client-server model for interactive
1131
large scale scientific computation.
1132
\newblock In \emph{Selected Papers and Invited Talks from the Third
1133
International Conference on Vector and Parallel Processing}, VECPAR '98,
1134
pages 156--169, London, UK, 1999. Springer-Verlag.
1135
\newblock ISBN 3-540-66228-6.
1139
\bibitem[III(2007)]{Pal07}
1141
\newblock \emph{A Concise Introduction to Matlab}.
1142
\newblock McGraw-Hill Science/Engineering/Math, October 2007.
1143
\newblock ISBN 978-0073385839.
1145
\bibitem[Krishnan and Nieplocha(2004)]{Kri04}
1146
M.~Krishnan and J.~Nieplocha.
1147
\newblock Srumma: a matrix multiplication algorithm suitable for clusters and
1148
scalable shared memory systems.
1149
\newblock In \emph{in proceedings of Parallel and Distributed Processing
1152
\bibitem[Kristensen and Vinter(2010)]{Kri10}
1153
M.~R.~B. Kristensen and B.~Vinter.
1154
\newblock {Numerical Python for Scalable Architectures}.
1155
\newblock In \emph{Fourth Conference on Partitioned Global Address Space
1156
Programming Model, PGAS{'}10}. ACM, 2010.
1158
\bibitem[Lundh(2001)]{Lun01}
1160
\newblock \emph{Python Standard Library}.
1161
\newblock O'Reilly Media, May 2001.
1162
\newblock ISBN 978-0-596-00096-7.
1164
\bibitem[Nieplocha et~al.(2000)Nieplocha, Ju, and Straatsma]{Nie00}
1165
J.~Nieplocha, J.~Ju, and T.~P. Straatsma.
1166
\newblock A multiprotocol communication support for the global address space
1167
programming model on the ibm sp.
1168
\newblock In \emph{Proceedings from the 6th International Euro-Par Conference
1169
on Parallel Processing}, Euro-Par '00, pages 718--728, London, UK, 2000.
1171
\newblock ISBN 3-540-67956-1.
1172
\newblock URL \url{http://portal.acm.org/citation.cfm?id=646665.699092}.
1174
\bibitem[Nieplocha et~al.(2005{\natexlab{a}})Nieplocha, Krishnan, Palmer,
1175
Tipparaju, and Zhang]{Nie05}
1176
J.~Nieplocha, M.~Krishnan, B.~Palmer, V.~Tipparaju, and Y.~Zhang.
1177
\newblock Exploiting processor groups to extend scalability of the ga shared
1178
memory programming model.
1179
\newblock In \emph{Proceedings of the 2nd conference on Computing frontiers},
1180
CF '05, pages 262--272, New York, NY, USA, 2005{\natexlab{a}}. ACM.
1181
\newblock ISBN 1-59593-019-1.
1182
\newblock \doi{http://doi.acm.org/10.1145/1062261.1062305}.
1183
\newblock URL \url{http://doi.acm.org/10.1145/1062261.1062305}.
1185
\bibitem[Nieplocha et~al.(2005{\natexlab{b}})Nieplocha, Palmer, Krishnan,
1186
Trease, and Apr\'{a}]{Nie06}
1187
J.~Nieplocha, B.~Palmer, M.~Krishnan, H.~Trease, and E.~Apr\'{a}.
1188
\newblock Advances, applications and performance of the global arrays shared
1189
memory programming toolkit.
1190
\newblock \emph{Intern. J. High Perf. Comp. Applications}, 20,
1193
\bibitem[Nieplocha et~al.(2010)Nieplocha, Krishnan, Palmer, Tipparaju, and
1195
J.~Nieplocha, M.~Krishnan, B.~Palmer, V.~Tipparaju, and J.~Ju.
1196
\newblock The gobal arrays user's manual.
1197
\newblock http://www.emsl.pnl.gov/docs/global/um/UG-PDF/GA-UserManual-Main.pdf,
1200
\bibitem[Oliphant(2006)]{Oli06}
1202
\newblock Guide to numpy.
1203
\newblock http://www.tramy.us/, 2006.
1205
\bibitem[Panuganti et~al.(2006)Panuganti, Baskaran, Hudak, Krishnamurthy,
1206
Nieplocha, Rountev, , and Sadayappan]{Pan06}
1207
R.~Panuganti, M.~M. Baskaran, D.~E. Hudak, A.~Krishnamurthy, J.~Nieplocha,
1208
A.~Rountev, , and P.~Sadayappan.
1209
\newblock Gamma: Global arrays meets matlab.
1210
\newblock Technical Report OSU-CISRC-1/06-TR15, Ohio State University, 2006.
1212
\bibitem[Perez et~al.(2009)Perez, Langtangen, and LeVeque]{Per09}
1213
F.~Perez, H.~P. Langtangen, and R.~LeVeque.
1214
\newblock {CSE 2009: Python for Scientific Computing at CSE 2009}.
1215
\newblock SIAM News, 2009.
1216
\newblock URL \url{http://www.siam.org/news/news.php?id=1595}.
1218
\bibitem[P\i{e}rez and Granger(2007)]{Per07}
1219
F.~P\i{e}rez and B.~E. Granger.
1220
\newblock Ipython: A system for interactive scientific computing.
1221
\newblock \emph{Computing in Science and Engineering}, 9:\penalty0 21--29,
1223
\newblock ISSN 1521-9615.
1224
\newblock \doi{http://doi.ieeecomputersociety.org/10.1109/MCSE.2007.53}.
1226
\bibitem[Ramachandran(2008)]{Ram08}
1228
\newblock Performance python.
1229
\newblock http://www.scipy.org/PerformancePython, May 2008.
1231
\bibitem[Rasmussen et~al.(2004)Rasmussen, Sottile, Nieplocha, Numrich, and
1233
C.~E. Rasmussen, M.~J. Sottile, J.~Nieplocha, R.~W. Numrich, and E.~Jones.
1234
\newblock Co-array python: A parallel extension to the python language.
1235
\newblock In \emph{Euro-Par 2004}, pages 632--637. Springer-Verlag Berlin
1238
\bibitem[Scullin and Ahmadia()]{Scu11}
1239
W.~Scullin and A.~Ahmadia.
1241
\newblock https://bitbucket.org/wscullin/walla/overview.
1243
\bibitem[Wang(2011)]{Wan11}
1245
\newblock Metagraph: A declarative graphing system for python.
1246
\newblock In \emph{Proceedings of the 10th Python in Science Conference}.
1248
\newblock To appear.
1250
\bibitem[Zima et~al.(1988)Zima, Bast, and Gerndt]{Zim88}
1251
H.~P. Zima, H.-J. Bast, and M.~Gerndt.
1252
\newblock Superb: A tool for semi-automatic mimd/simd parallelization.
1253
\newblock \emph{Parallel Computing}, 6:\penalty0 1--18, 1988.
1254
\newblock \doi{10.1016/0167-8191(88)90002-6}.
1256
\end{thebibliography}
1257
% The bibliography should be embedded for final submission.