2
For more complete information on NetPIPE, visit the webpage at:
4
http://www.scl.ameslab.gov/Projects/NetPIPE/
6
NetPIPE was originally developed by Quinn Snell, Armin Mikler,
7
John Gustafson, and Guy Helmer.
9
It is currently being developed and maintained by Dave Turner with
10
help from several graduate students (Xuehua Chen, Adam Oline,
11
Brian Smith, Bogdan Vasiliu).
13
Release 3.6.2 mainly fixes some bugs. A number of portability issues
14
with 64-bit architectures were taken care of, especially in the Infiniband
15
module. A small typecasting error was fixed that caused segmentation faults
16
on Red Hat Enterprise and Fedora Core systems (and probably others...). The
17
bi-directional mode was also tested with the Infiniband module, and a subset
18
of NetPIPE options are now supported.
20
Release 3.6.1 adds a bi-directional (-2) mode to allow data to be sent
21
in both directions simultaneously. This has been tested with the
22
TCP, MPI, MPI-2, and GM modules. You can also now test
23
synchronous MPI communications MPI_SSend/MPI_SRecv using (-S).
24
A launch utility (nplaunch) allows you to launch NPtcp, NPgm,
25
NPib, and NPpvm from one side using ssh to start the remote executible.
27
Version 3.6 adds the ability to test with and without cache effects,
28
and the ability to offset both the source and destination buffers.
29
A memcpy module has also been added.
31
Release 3.5 removes the CPU utilization measurements. Getrusage is
32
probably not very accurate, so a dummy workload will eventually be
34
The streaming mode has also been fixed. When run at Gigabit speeds,
35
the TCP window size would collapse limit performance of subsequent
36
data points. Now we reset the sockets between trials to prevent this.
37
We have also added in a module to evaluate memory copy rates.
38
-n now sets a constant number of repeats for each trial.
39
-r resets the sockets between each trial (automatic for streaming).
41
Release 3.3 includes an Infiniband module for the Mellanox VAPI.
42
It also has an integrity check (-i), which is still being developed.
44
Version 3.2 includes additional modules to test
45
PVM, TCGMSG, SHMEM, and MPI-2, as well as the GM, GPSHMEM, ARMCI, and LAPI
46
software layers they run upon.
48
If you have problems or comments, please email netpipe@scl.ameslab.gov
50
____________________________________________________________________________
52
NetPIPE Network Protocol Independent Performance Evaluator, Release 2.3
53
Copyright 1997, 1998 Iowa State University Research Foundation, Inc.
55
This program is free software; you can redistribute it and/or modify
56
it under the terms of the GNU General Public License as published by
57
the Free Software Foundation. You should have received a copy of the
58
GNU General Public License along with this program; if not, write to the
59
Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
60
____________________________________________________________________________
66
NetPIPE requires an ANSI C compiler. You are on your own for
67
installing the various libraries that NetPIPE can be used to
70
Review the provided makefile and change any necessary settings, such
71
as the CC compiler or CFLAGS flags, required extra libraries, and PVM
72
library & include file pathnames if you have these communication
73
libraries. Alternatively, you can specify these changes on the
74
make command line. The line below would compile the NPtcp module
75
using the icc compiler instead of the default cc compiler.
79
Compile NetPIPE with the desired communication interface by using:
81
make mpi (this will use the default MPI on the system)
82
make pvm (you may need to set some paths in the makefile)
83
make tcgmsg (you will need to set some paths in the makefile)
84
make mpi2 (this will test 1-sided MPI_Put() functions)
85
make shmem (1-sided library for Cray and SGI systems)
88
make gm (for Myrinet cards, you will need to set some paths)
89
make shmem (1-sided library for Cray and SGI systems)
90
make gpshmem (SHMEM interface for other machines)
91
make armci (still under development)
92
make lapi (for the IBM SP)
93
make ib (for Mellanox Infiniband adapters, uses VAPI layer)
94
make memcpy (uses memcpy to copy data between buffers in 1 process)
95
make MP_memcpy (uses an optimized copy in MP_memcpy.c to copy data between
96
buffers. This requires icc or gcc 3.x.)
102
NetPIPE will dump its output to the screen by default and also
103
to the np.out. The following parameters can be used to change how
104
NetPIPE is run, and are in order of their general usefulness.
106
-b: specify send and receive TCP buffer sizes e.g. "-b 32768"
107
This can make a huge difference for Gigabit Ethernet cards.
108
You may need to tune the OS to set a larger maximum TCP
109
buffer size for optimal performance.
111
-O: specify send and optionally receive buffer offsets, e.g. "-O 1,3"
113
-l: lower bound (start value for block size) e.g. "-l 1"
115
-u: upper bound (stop value for block size) e.g. "-u 1048576"
117
-o: specify output filename e.g. "-o output.txt"
119
-z: for MPI, receive messages using ANYSOURCE
121
-g: MPI-2: use MPI_Get() instead of MPI_Put()
123
-f: MPI-2: do not use a fence call (may not work for all packages)
125
-I: Invalidate cache: Take measures to eliminate the effects cache
128
-a: asynchronous receive (a.k.a. pre-posted receive)
129
May not have any effect, depending on your implementation
131
-B: burst all preposts before measuring performance
132
Normally only one receive is preposted at a time with -a
134
-p: set perturbation offset of buffer size, e.g. "-p 3"
136
-i: Integrity check: Check the integrity of data transfer instead
139
-s: stream option (default mode is "ping pong")
140
If this option is used, it must be specified on both
141
the sending and receiving processes
143
-S: Use synchronous sends/receives for MPI.
145
-2: Bi-directional communications. Transmit in both directions
151
Compile NetPIPE using 'make tcp'
153
remote_host> NPtcp [options]
154
local_host> NPtcp -h remote_host [options]
158
local_host> nplaunch NPtcp -h remote_host [options]
164
Compile NetPIPE using 'make mpi'
165
use p4pg file or edit mpich/util/mach/mach.{ARCH} file
166
to specify the machines to run on
167
mpirun [-nolocal] -np 2 NPmpi [options]
168
'setenv P4_SOCKBUFSIZE 256000' can make a huge difference for
169
MPICH on Unix systems.
171
LAM/MPI (comes on the RedHat Linux distributions now)
175
Compile NetPIPE using 'make mpi'
176
put the machine names into a lamhosts file
177
'lamboot -v -b lamhosts' to start the lamd daemons
178
mpirun -np 2 [-O] NPmpi [options]
179
The -O parameter avoids data translation for homogeneous systems.
181
MPI/Pro (commercial version)
185
Compile NetPIPE using 'make mpi'
186
put the machine names into /etc/machines or a local machine file
187
mpirun -np 2 NPmpi [options]
189
MP_Lite (A lightweight version of MPI)
192
Install MP_Lite (http://www.scl.ameslab.gov/Projects/MP_Lite/)
193
Compile NetPIPE using 'make MP_Lite'
194
mprun -np 2 -h {host1} {host2} NPmplite [options]
199
Install PVM (comes on the RedHat distributions now)
200
Set the PVM paths in the makefile if necessary.
201
Compile NetPIPE using 'make pvm'
202
use the 'pvm' utility to start the pvmd daemons
203
type 'pvm' to start it (this will also start pvmd on the local_host)
204
pvm> help --> lists all commands
205
pvm> add remote_host --> will start a pvmd on a machine called 'host2'
206
pvm> quit --> when you have all the pvmd machines started
207
remote_host> NPpvm [options]
208
local_host> NPpvm -h remote_host [options]
210
local_host> nplaunch NPpvm -h remote_host [options]
211
Changing PVMDATA in netpipe.h and PvmRouteDirect in pvm.c can
212
effect the performance greatly.
214
TCGMSG (unlikely anyone will try this that doesn't know TCGMSG well)
217
Install TCGMSG package
218
Set the TCGMSG paths in the makefile.
219
Compile NetPIPE using 'make tcgmsg'
220
create a NPtcgmsg.p file with hosts and paths (see hosts/NPtcgmsg.p)
222
(no options can be passed into this version)
227
Install the MPI package
228
Compile NetPIPE using 'make mpi2'
229
Follow the directions for running the MPI package from above
230
The MPI_Put() function will be tested with fence calls by default.
231
Use -g to test MPI_Get() instead, or -f to do MPI_Put() without
232
fence calls (will not work with LAM).
237
Must be run on a Cray or SGI system that supports SHMEM calls.
238
Compile NetPIPE using 'make shmem'
239
(Xuehua, fill out the rest)
241
GPSHMEM (a General Purpose SHMEM library) (gpshmem.c in development)
244
Ask Ricky or Krzysztof for help :).
246
GM (test the raw performance of GM on Myrinet cards)
249
Install the GM package and configure the Myrinet cards
250
Compile NetPIPE using 'make gm'
252
remote_host> NPgm [options]
253
local_host> NPgm -h remote_host [options]
257
local_host> nplaunch NPgm -h remote_host [options]
262
Log into IBM SP machine at NERSC
263
Compile NetPIPE using 'make lapi'
265
To run interactively at NERSC:
267
Set environment variable MP_MSG_API to lapi
268
e.g. 'setenv MP_MSG_API lapi', 'export MP_MSG_API=lapi'
269
Run NPlapi with '-procs 2' to tell the parallel environment you
270
want 2 nodes. Use any other options that are applicable to
273
To submit a batch job at NERSC:
275
Copy the file batchLapi from the 'hosts' directory to the directory
277
Edit the copy of batchLapi:
278
job_name: Identifying name of job, can be anything
279
output: File to send stdout to
280
error: File to send stderr to (most of NetPIPE's output
282
tasks_per_node: Number of tasks to be run on each node
283
node: Number of nodes to run on
284
(Use a combination of the above two options to determine
285
how NetPIPE runs. Use 1 task per node and 2 nodes to run
286
benchmark between nodes. Use 2 tasks per node and 1 node
287
to run benchmark on single node)
289
Use whatever command-line options are appropriate for NetPIPE
291
Submit the job with the command 'llsubmit batchLapi'
292
Check status of all your jobs with 'llqs -u <user>'
293
You should receive an email when the job finishes. The resulting output
294
files will then be available.
299
Install the ARMCI package
300
Compile NetPIPE using 'make armci'
301
Follow the directions for running the MPI package from above
302
If running on interfaces other than the default, create a file
303
called armci_hosts, containing two lines, one for each hostname,
309
This test will only work on machines connected via TCP/IP as well
311
Install Mellanox Infiniband adapters and software
312
Make sure the adapters are up and running (e.g. Check that the
313
Mellanox-supplied bandwidth/latency program, perf_main, works, if
315
Compile NetPIPE using 'make ib' (The environment variable MTHOME needs
316
to be set to the directory containing the include and lib directories
317
for the Mellanox software).
319
remote_host> NPib [-options]
320
local_host> NPib -h remote_host [-options]
324
local_host> nplaunch NPib -h remote_host [options]
326
(remote_host should be the ip address or hostname of the other host)
329
Use -m to select mtu size for Infiniband adapter.
330
Valid values are 256, 512, 1024, 2048, 4096. Default is 1024.
331
Use -t to select the communications type.
333
send_recv: basic send and receive
334
send_recv_with_imm: send and receive with immediate data
335
rdma_write: one-sided remote dma write
336
rdma_write_with_imm: one-sided remote dma write with immediate data
337
Default is send_recv.
338
Use -c to select the message completion type.
340
local_poll: poll on last byte of receive buffer
341
vapi_poll: use VAPI polling mechanism
342
event: use VAPI event completion mechanism
343
Default is local_poll.
345
Interpreting the Results
346
------------------------
348
NetPIPE generates a np.out file by default, which can be renamed using the
349
-o option. This file contains 3 columns: the number of bytes, the
350
throughput in Mbps, and the round trip time divided by two.
351
The first 2 columns can therefore be used to produce a throughput vs
354
The screen output contains this same information, plus the test number
355
and the number of ping-pong's involved in the test.
358
1 0.136403 0.00005593
359
2 0.274586 0.00005557
360
3 0.402104 0.00005692
361
4 0.545668 0.00005593
362
6 0.805053 0.00005686
363
8 1.039586 0.00005871
364
12 1.598912 0.00005726
365
13 1.700719 0.00005832
366
16 2.098007 0.00005818
367
19 2.340364 0.00006194
373
The -I switch can be used to reduce the effects cache has on performance.
374
Without the switch, NetPIPE tests the performance of communicating
375
n-byte blocks by reading from an n-byte buffer on one node, sending data
376
over the communications link, and writing to an n-byte buffer on the other
377
node. For each block size, this trial will be repeated x times, where x
378
typically starts out very large for small block sizes, and decreases as the
379
block size grows. The same buffers on each node are used repeatedly, so
380
after the first transfer the entire buffer will be in cache on each node,
381
given that the block-size is less than the available cache. Thus each transfer
382
after the first will be read from cache on one end and written into cache on
383
the other. Depending on the cache architecture, a write to main memory may
384
not occur on the receiving end during the transfer loop.
386
While the performance measurements obtained from this method are certainly
387
useful, it is also interesting to use the -I switch to measure performance
388
when data is read from and written to main memory. In order to facilitate
389
this, large pools of memory are allocated at startup, and each n-byte transfer
390
comes from a region of the pool not in cache. Before each series of n-byte
391
transfers, every byte of a large dummy buffer is individually accessed in
392
order to flush the data for the transfer out of cache. After this step, the
393
first n-byte transfer comes from the beginning of the large pool, the second
394
comes from n-bytes after the beginning of the pool, and so on (note that stride
395
between n-byte transfers will depend on the buffer alignment setting). In this
396
way we make sure each read is coming from main memory.
398
On the receiving end data is written into a large pool in the same fashion
399
that it was read on the transmitting end. Data will first be written into
400
cache. What happens next depends on the cache architecture, but one case is
401
that no transfer to main memory occurs, YET. For moderately large block
402
sizes, however, a large number of transfer iterations will cause reuse of
403
cache memory. As this occurs, data in the cache location to be replaced must
404
be written back to main memory, so we incur a performance penalty while we
407
In summary, using the -I switch gives worst-case performance (i.e. all data
408
transfers involve reading from or writing to memory not in cache) and not
409
using the switch gives best-case performance (i.e. all data transfers involve
410
only reading from or writing to memory in cache). Note that other combinations,
411
such as reading from memory in cache and writing to memory not in cache, would
412
give intermediary results. We chose to implement the methods that will measure
418
- we need to replace the getrusage stuff from version 2.4 with a dummy