1
\documentclass{sig-alt-release2}
5
% --- Author Metadata here ---
6
\conferenceinfo{SC'11 Companion,} {November 12--18, 2011, Seattle, Washington, USA.}
8
\crdata{978-1-4503-1030-7/11/11}
11
% --- End of Author Metadata ---
13
\title{Poster: Automatic Parallelization of Numerical Python Applications Using the Global Arrays Toolkit}
15
\auwidth=1.12\textwidth
21
\affaddr{Pacific Northwest National Laboratory}\\
22
\affaddr{P.O. Box 999}\\
24
\affaddr{Richland, WA}\\
25
\email{jeff.daily@pnnl.gov}
29
\affaddr{School of Electrical Engineering and Computer Science}\\
30
\affaddr{Washington State University Tricities}\\
31
\affaddr{2710 Crimson Way}\\
32
\affaddr{Richland, WA 99354}\\
33
\email{bobl@tricity.wsu.edu}
34
%\and % use '\and' if you need 'another row' of author names
38
% citations aren't typically found in an abstract, right?
40
Global Arrays is a software system from Pacific Northwest National Laboratory
41
that enables an efficient, portable, and parallel shared-memory programming
42
interface to manipulate distributed dense arrays. The NumPy module is the de
43
facto standard for numerical calculation in the Python programming language, a
44
language whose use is growing rapidly in the scientific and engineering
45
communities. NumPy provides a powerful N-dimensional array class as well as
46
other scientific computing capabilities. However, like the majority of the core
47
Python modules, NumPy is inherently serial. Using a combination of Global
48
Arrays and NumPy, we have reimplemented NumPy as a distributed drop-in
49
replacement called Global Arrays in NumPy (GAiN). Serial NumPy applications can
50
become parallel, scalable GAiN applications with only minor source code
51
changes. Scalability studies of several different GAiN applications will be
52
presented showing the utility of developing serial NumPy codes which can later
53
run on more capable clusters or supercomputers.
56
\category{D.1.3}{Programming Techniques}{Concurrent Programming}
57
\category{D.2.2}{Software Engineering}{Design Tools and Techniques}[software libraries, user interfaces]
58
%\category{D.2.8}{Software Engineering}{Metrics}
59
\category{E.1}{Data Structures}{arrays, distributed data structures}
61
\terms{Design, Performance}
63
\keywords{Global Arrays, PGAS, Python, NumPy, GAiN}
65
\section{Introduction}
66
Scientific computing with Python\cite{Ros90} typically involves using the
67
NumPy\cite{Oli06} package. NumPy provides an efficient multi-dimensional
68
array and array processing routines. Unfortunately, like many Python programs,
69
NumPy is serial in nature. This limits both the size of the arrays as well as
70
the speed with which the arrays can be processed to the available resources on
71
a single compute node.
73
%For the most part, NumPy programs are written, debugged, and run in
74
%singly-threaded environments. This may be sufficient for certain problem
75
%domains. However, NumPy may also be used to develop prototype software. Such
76
%software is usually ported to a different, compiled language and/or explicitly
77
%parallelized to take advantage of additional hardware.
79
The Global Arrays (GA) toolkit \cite{Nie06,Pnl11} is a software system
80
from Pacific Northwest National Laboratory (PNNL) that enables an efficient,
81
portable, and parallel shared-memory programming interface to manipulate
82
physically distributed dense multidimensional arrays, without the need for
83
explicit cooperation by other processes. GA compliments the message-passing
84
programming model and is compatible with MPI\cite{Gro99a} so that the
85
programmer can use both in the same program. GA has supported Python bindings
86
since version 5.0. GA has been leveraged in several large computational
87
chemistry codes and has been shown to scale well \cite{Apr09}.
89
Global Arrays in NumPy (GAiN)\cite{Dai09,Dai11} is an extension to Python that
90
provides parallel, distributed processing of arrays. It implements a subset of
91
the NumPy API so that for some programs, by simply importing GAiN in place of
92
NumPy they may be able to take advantage of parallel processing transparently.
93
Other programs may require slight modification. This allows those programs to
94
take advantage of the additional cores available on single compute nodes and to
95
increase problem sizes by distributing across clustered environments.
98
The success of GAiN hinges on its ability to enable distributed array
99
processing in NumPy, to transparently enable this processing, and most
100
importantly to efficiently accomplish those goals. Performance Python
101
\cite{Ram08} ``perfpy'' was conceived to demonstrate the ways Python can be
102
used for high performance computing. It evaluates NumPy and the relative
103
performance of various Python extensions to NumPy. It represents an important
104
benchmark by which any additional high performance numerical Python module
105
should be measured. The original program \texttt{laplace.py} was modified by
106
importing \texttt{ga.gain} in place of \texttt{numpy} and then stripping the
107
additional test codes so that only the \texttt{gain} (\texttt{numpy}) test
109
%The latter modification makes no impact on the timing results since all tests
110
%are run independently but was necessary because \texttt{gain} is run on
111
%multiple processes while the original test suite is serial.
112
The program was run on the chinook supercomputer at the Environmental
113
Molecular Sciences Laboratory (EMSL), part of Pacific Northwest National
114
Laboratory (PNNL). Chinook consists of 2310 HP DL185 nodes with dual socket,
115
64-bit, Quad-core AMD 2.2 GHz Opteron processors. Each node has 32 Gbytes of
116
memory for 4 Gbytes per core and a local scratch disk.
117
%Fast communication between the nodes is obtained using a single rail
118
%Infiniband interconnect from Voltaire (switches) and Melanox (NICs). The
119
%system runs a version of Linux based on Red Hat Linux Advanced Server.
120
GAiN utilized up to 512 nodes of the cluster, using 4 cores per node.
122
We noted during initial evaluations of \texttt{laplace.py} that the wall clock
123
time did not compare favorably with the timer for the solver. Figure
124
\ref{fig:laplace} quantifies this behavior using a strong scaling test of the
125
two timers. Although the solver is shown to scale up to 2K cores on a modest
126
problem size, the overall time to run the application exhibits inverse scaling.
127
Previous research \cite{Scu11,Man11} confirms our results and indicates the
128
problem is due to Python's excessive access of the file system during the
129
import of modules. In a serial environment, this is reasonable behavior. For a
130
shared filesystem on a cluster, all Python interpreters are thrashing the on
135
\includegraphics[width=0.48\textwidth]{laplace.eps}
136
%\epsfig{file=laplace,width=0.48\textwidth}
137
\caption{laplace.py Strong Scaling for 10K$^2$ Matrix}
141
Our solution to the problem is unique to clusters where the compute nodes have
142
local scratch disks such as EMSL's chinook system. The Python interpreter and
143
its associated modules and shared libraries are bundled and broadcast from node
144
zero to the remaining nodes, the master process on each node unbundles the
145
files, and the Python interpreter is run locally from each compute node. This
146
limits disk access to the local scratch disks where only eight processes
147
compete for the local filesystem. Our tests execute the simple Python program
148
``\texttt{import numpy}''. Figure \ref{fig:python} compares the cost of running
149
a global-filesystem shared Python interpreter, the individual cost of the
150
broadcast, the individual cost of running a node-local Python interpreter, as
151
well as the cumulative cost of the proposed solution. The proposed solution
152
performs better than the global filesystem case.
156
\includegraphics[width=0.48\textwidth]{python.eps}
157
%\epsfig{file=python,width=0.48\textwidth}
158
\caption{`import numpy' Strong Scaling}
163
GAiN succeeds in its ability to grow problem sizes beyond a single compute
164
node. The performance of the perfpy code and the ability to drop-in GAiN
165
without modification of the core implementation demonstrates its utility. However, the scalability of the Python interpreter on a cluster using a shared filesystem limits the utility of Python. Our proposed solution, as well as the related solutions\cite{Scu11,Man11}, alleviate the cost of using Python at scale until a more robust solution can be found.
167
\section{Acknowledgments}
168
A portion of the research was performed using the Molecular Science Computing
169
(MSC) capability at EMSL, a national scientific user facility sponsored by the
170
Department of Energy's Office of Biological and Environmental Research and
171
located at Pacific Northwest National Laboratory (PNNL). %PNNL is operated by
172
%Battelle for the U.S. Department of Energy under contract DE-AC05-76RL01830.
175
% The following two commands are all you need in the
176
% initial runs of your .tex file to
177
% produce the bibliography for the citations in your paper.
178
\bibliographystyle{abbrv}
179
\bibliography{post174-daily}
180
% You must have a proper ".bib" file
181
% and remember to run:
182
% latex bibtex latex latex
183
% to resolve all references
185
% ACM needs 'a single self-contained file'!