1
<!--#include virtual="header.txt"-->
5
<p>MPI use depends upon the type of MPI being used.
6
There are three fundamentally different modes of operation used
7
by these various MPI implementation.
9
<li>SLURM directly launches the tasks and performs initialization
10
of communications (Quadrics MPI, MPICH2, MPICH-GM, MPICH-MX,
11
MVAPICH, MVAPICH2, some MPICH1 modes, and future versions of OpenMPI).</li>
12
<li>SLURM creates a resource allocation for the job and then
13
mpirun launches tasks using SLURM's infrastructure (OpenMPI,
14
LAM/MPI and HP-MPI).</li>
15
<li>SLURM creates a resource allocation for the job and then
16
mpirun launches tasks using some mechanism other than SLURM,
17
such as SSH or RSH (BlueGene MPI and some MPICH1 modes).
18
These tasks initiated outside of SLURM's monitoring
19
or control. SLURM's epilog should be configured to purge
20
these tasks when the job's allocation is relinquished. </li>
22
<p>Links to instructions for using several varieties of MPI
23
with SLURM are provided below.
25
<li><a href="#bluegene_mpi">BlueGene MPI</a></li>
26
<li><a href="#hp_mpi">HP-MPI</a></li>
27
<li><a href="#lam_mpi">LAM/MPI</a></li>
28
<li><a href="#mpich1">MPICH1</a></li>
29
<li><a href="#mpich2">MPICH2</a></li>
30
<li><a href="#mpich_gm">MPICH-GM</a></li>
31
<li><a href="#mpich_mx">MPICH-MX</a></li>
32
<li><a href="#mvapich">MVAPICH</a></li>
33
<li><a href="#mvapich2">MVAPICH2</a></li>
34
<li><a href="#open_mpi">Open MPI</a></li>
35
<li><a href="#quadrics_mpi">Quadrics MPI</a></li>
37
<hr size=4 width="100%">
40
<h2><a name="open_mpi" href="http://www.open-mpi.org/"><b>Open MPI</b></a></h2>
42
<p>Open MPI relies upon
43
SLURM to allocate resources for the job and then mpirun to initiate the
44
tasks. When using <span class="commandline">salloc</span> command,
45
<span class="commandline">mpirun</span>'s -nolocal option is recommended.
48
$ salloc -n4 sh # allocates 4 processors
49
# and spawns shell for job
50
> mpirun -np 4 -nolocal a.out
51
> exit # exits shell spawned by
52
# initial srun command
54
<p>Note that any direct use of <span class="commandline">srun</span>
55
will only launch one task per node when the LAM/MPI plugin is used.
56
To launch more than one task per node using the
57
<span class="commandline">srun</span> command, the <i>--mpi=none</i>
58
option will be required to explicitly disable the LAM/MPI plugin.</p>
61
<p>There is work underway in both SLURM and Open MPI to support task launch
62
using the <span class="commandline">srun</span> command.
63
We expect this mode of operation to be supported late in 2009.
64
It may differ slightly from the description below.
65
It relies upon SLURM version 2.0 (or higher) managing
66
reservations of communication ports for the Open MPI's use.
67
The system administrator must specify the range of ports to be reserved
68
in the <i>slurm.conf</i> file using the <i>MpiParams</i> parameter.
70
<i>MpiParams=ports=12000-12999</i></p>
72
<p>Launch tasks using the <span class="commandline">srun</span> command
73
plus the option <i>--resv-ports</i>.
74
The ports reserved on every allocated node will be identified in an
75
environment variable available to the tasks as shown here: <br>
76
<i>SLURM_STEP_RESV_PORTS=12000-12015</i></p>
78
<p>If the ports reserved for a job step are found by the Open MPI library
79
to be in use, a message of this form will be printed and the job step
80
will be re-launched:<br>
81
<i>srun: error: sun000: task 0 unble to claim reserved port, retrying</i><br>
82
After three failed attempts, the job step will be aborted.
83
Repeated failures should be reported to your system administrator in
84
order to rectify the problem by cancelling the processes holding those
86
<hr size=4 width="100%">
89
<h2><a name="quadrics_mpi" href="http://www.quadrics.com/"><b>Quadrics MPI</b></a></h2>
91
<p>Quadrics MPI relies upon SLURM to
92
allocate resources for the job and <span class="commandline">srun</span>
93
to initiate the tasks. One would build the MPI program in the normal manner
94
then initiate it using a command line of this sort:</p>
96
$ srun [options] <program> [program args]
98
<hr size=4 width="100%">
101
<h2><a name="lam_mpi" href="http://www.lam-mpi.org/"><b>LAM/MPI</b></a></h2>
103
<p>LAM/MPI relies upon the SLURM
104
<span class="commandline">salloc</span> or <span class="commandline">sbatch</span>
105
command to allocate. In either case, specify
106
the maximum number of tasks required for the job. Then execute the
107
<span class="commandline">lamboot</span> command to start lamd daemons.
108
<span class="commandline">lamboot</span> utilizes SLURM's
109
<span class="commandline">srun</span> command to launch these daemons.
110
Do not directly execute the <span class="commandline">srun</span> command
111
to launch LAM/MPI tasks. For example:
113
$ salloc -n16 sh # allocates 16 processors
114
# and spawns shell for job
116
> mpirun -np 16 foo args
117
1234 foo running on adev0 (o)
118
2345 foo running on adev1
122
> exit # exits shell spawned by
123
# initial srun command
125
<p>Note that any direct use of <span class="commandline">srun</span>
126
will only launch one task per node when the LAM/MPI plugin is configured
127
as the default plugin. To launch more than one task per node using the
128
<span class="commandline">srun</span> command, the <i>--mpi=none</i>
129
option would be required to explicitly disable the LAM/MPI plugin
130
if that is the system default.</p>
131
<hr size=4 width="100%">
134
<h2><a name="hp_mpi" href="http://www.hp.com/go/mpi"><b>HP-MPI</b></a></h2>
137
<span class="commandline">mpirun</span> command with the <b>-srun</b>
138
option to launch jobs. For example:
140
$MPI_ROOT/bin/mpirun -TCP -srun -N8 ./a.out
142
<hr size=4 width="100%">
145
<h2><a name="mpich2" href="http://www.mcs.anl.gov/research/projects/mpich2/"><b>MPICH2</b></a></h2>
147
<p>MPICH2 jobs are launched using the <b>srun</b> command. Just link your program with
148
SLURM's implementation of the PMI library so that tasks can communicate
149
host and port information at startup. (The system administrator can add
150
these option to the mpicc and mpif77 commands directly, so the user will not
151
need to bother). For example:
153
$ mpicc -L<path_to_slurm_lib> -lpmi ...
158
<li>Some MPICH2 functions are not currently supported by the PMI
159
library integrated with SLURM</li>
160
<li>Set the environment variable <b>PMI_DEBUG</b> to a numeric value
161
of 1 or higher for the PMI library to print debugging information</li>
163
<hr size=4 width="100%">
166
<h2><a name="mpich_gm" href="http://www.myri.com/scs/download-mpichgm.html"><b>MPICH-GM</b></a></h2>
168
<p>MPICH-GM jobs can be launched directly by <b>srun</b> command.
169
SLURM's <i>mpichgm</i> MPI plugin must be used to establish communications
170
between the launched tasks. This can be accomplished either using the SLURM
171
configuration parameter <i>MpiDefault=mpichgm</i> in <b>slurm.conf</b>
172
or srun's <i>--mpi=mpichgm</i> option.
175
$ srun -n16 --mpi=mpichgm a.out
177
<hr size=4 width="100%">
180
<h2><a name="mpich_mx" href="http://www.myri.com/scs/download-mpichmx.html"><b>MPICH-MX</b></a></h2>
182
<p>MPICH-MX jobs can be launched directly by <b>srun</b> command.
183
SLURM's <i>mpichmx</i> MPI plugin must be used to establish communications
184
between the launched tasks. This can be accomplished either using the SLURM
185
configuration parameter <i>MpiDefault=mpichmx</i> in <b>slurm.conf</b>
186
or srun's <i>--mpi=mpichmx</i> option.
189
$ srun -n16 --mpi=mpichmx a.out
191
<hr size=4 width="100%">
194
<h2><a name="mvapich" href="http://mvapich.cse.ohio-state.edu/"><b>MVAPICH</b></a></h2>
196
<p>MVAPICH jobs can be launched directly by <b>srun</b> command.
197
SLURM's <i>mvapich</i> MPI plugin must be used to establish communications
198
between the launched tasks. This can be accomplished either using the SLURM
199
configuration parameter <i>MpiDefault=mvapich</i> in <b>slurm.conf</b>
200
or srun's <i>--mpi=mvapich</i> option.
203
$ srun -n16 --mpi=mvapich a.out
205
<b>NOTE:</b> If MVAPICH is used in the shared memory model, with all tasks
206
running on a single node, then use the <i>mpich1_shmem</i> MPI plugin instead.<br>
207
<b>NOTE (for system administrators):</b> Configure
208
<i>PropagateResourceLimitsExcept=MEMLOCK</i> in <b>slurm.conf</b> and
209
start the <i>slurmd</i> daemons with an unlimited locked memory limit.
210
For more details, see
211
<a href="http://mvapich.cse.ohio-state.edu/support/mvapich_user_guide.html#x1-420007.2.3">MVAPICH</a>
212
documentation for "CQ or QP Creation failure".</p>
213
<hr size=4 width="100%">
216
<h2><a name="mvapich2" href="http://nowlab.cse.ohio-state.edu/projects/mpi-iba"><b>MVAPICH2</b></a></h2>
218
<p>MVAPICH2 jobs can be launched directly by <b>srun</b> command.
219
SLURM's <i>none</i> MPI plugin must be used to establish communications
220
between the launched tasks. This can be accomplished either using the SLURM
221
configuration parameter <i>MpiDefault=none</i> in <b>slurm.conf</b>
222
or srun's <i>--mpi=none</i> option. The program must also be linked with
223
SLURM's implementation of the PMI library so that tasks can communicate
224
host and port information at startup. (The system administrator can add
225
these option to the mpicc and mpif77 commands directly, so the user will not
226
need to bother). <b>Do not use SLURM's MVAPICH plugin for MVAPICH2.</b>
228
$ mpicc -L<path_to_slurm_lib> -lpmi ...
229
$ srun -n16 --mpi=none a.out
231
<hr size=4 width="100%">
234
<h2><a name="bluegene_mpi" href="http://www.research.ibm.com/bluegene/"><b>BlueGene MPI</b></a></h2>
236
<p>BlueGene MPI relies upon SLURM to create the resource allocation and then
237
uses the native <span class="commandline">mpirun</span> command to launch tasks.
238
Build a job script containing one or more invocations of the
239
<span class="commandline">mpirun</span> command. Then submit
240
the script to SLURM using <span class="commandline">sbatch</span>.
243
$ sbatch -N512 my.script
245
<p>Note that the node count specified with the <i>-N</i> option indicates
246
the base partition count.
247
See <a href="bluegene.html">BlueGene User and Administrator Guide</a>
248
for more information.</p>
249
<hr size=4 width="100%">
252
<h2><a name="mpich1" href="http://www-unix.mcs.anl.gov/mpi/mpich1/"><b>MPICH1</b></a></h2>
254
<p>MPICH1 development ceased in 2005. It is recommended that you convert to
255
MPICH2 or some other MPI implementation.
256
If you still want to use MPICH1, note that it has several different
257
programming models. If you are using the shared memory model
258
(<i>DEFAULT_DEVICE=ch_shmem</i> in the mpirun script), then initiate
259
the tasks using the <span class="commandline">srun</span> command
260
with the <i>--mpi=mpich1_shmem</i> option.</p>
262
$ srun -n16 --mpi=mpich1_shmem a.out
265
<p>If you are using MPICH P4 (<i>DEFAULT_DEVICE=ch_p4</i> in
266
the mpirun script) and SLURM version 1.2.11 or newer,
267
then it is recommended that you apply the patch in the SLURM
268
distribution's file <i>contribs/mpich1.slurm.patch</i>.
269
Follow directions within the file to rebuild MPICH.
270
Applications must be relinked with the new library.
271
Initiate tasks using the
272
<span class="commandline">srun</span> command with the
273
<i>--mpi=mpich1_p4</i> option.</p>
275
$ srun -n16 --mpi=mpich1_p4 a.out
277
<p>Note that SLURM launches one task per node and the MPICH
278
library linked within your applications launches the other
279
tasks with shared memory used for communications between them.
280
The only real anomaly is that all output from all spawned tasks
281
on a node appear to SLURM as coming from the one task that it
282
launched. If the srun --label option is used, the task ID labels
283
will be misleading.</p>
285
<p>Other MPICH1 programming models current rely upon the SLURM
286
<span class="commandline">salloc</span> or
287
<span class="commandline">sbatch</span> command to allocate resources.
288
In either case, specify the maximum number of tasks required for the job.
289
You may then need to build a list of hosts to be used and use that
290
as an argument to the mpirun command.
295
srun hostname -s | sort -u >slurm.hosts
296
mpirun [options] -machinefile slurm.hosts a.out
298
$ sbatch -n16 mpich.sh
299
sbatch: Submitted batch job 1234
301
<p>Note that in this example, mpirun uses the rsh command to launch
302
tasks. These tasks are not managed by SLURM since they are launched
303
outside of its control.</p>
305
<p style="text-align:center;">Last modified 2 March 2009</p>
307
<!--#include virtual="footer.txt"-->