~ubuntu-branches/ubuntu/karmic/netpipe/karmic

« back to all changes in this revision

Viewing changes to dox/README

Committer: Bazaar Package Importer
Author(s): Camm Maguire
Date: 2004-10-26 20:28:24 UTC
mfrom: (1.1.1 upstream)
Revision ID: james.westby@ubuntu.com-20041026202824-fdmack9iksv54eqe

Tags: 3.6.2-1

New upstream release

files added:
bin

bin/feplot

bin/geplot

bin/nplaunch

debian/compat

debian/manpage_old

debian/netpipe-tcp.docs

debian/netpipe-tcp.files

debian/netpipe.1

debian/watch

dox/README

dox/netpipe.1

dox/netpipe_paper.ps

dox/np_cluster2002.pdf

dox/np_euro.pdf

hosts

hosts/NPtcgmsg.p

hosts/batchLapi

hosts/lamhosts

hosts/mvich.hosts

hosts/mvich.param

hosts/p4hosts

makefile

src/MP_memcpy.c

src/armci.c

src/gm.c

src/gpshmem.c

src/ib.c

src/lapi.c

src/memcpy.c

src/mpi.c

src/mpi2.c

src/netpipe.c

src/netpipe.h

src/pvm.c

src/shmem.c

src/tcgmsg.c

src/tcp.c

files removed:
COPYING

MPI.c

MPI.h

Makefile

PVM.c

PVM.h

README

TCP.c

TCP.h

netpipe.1

netpipe.c

netpipe.h

files modified:
debian/changelog

debian/control

debian/copyright

debian/docs

debian/netpipe-lam.docs

debian/netpipe-lam.files

debian/netpipe-mpich.docs

debian/netpipe-pvm.docs

debian/rules

Show diffs side-by-side

added added

removed removed

dox/README

For more complete information on NetPIPE, visit the webpage at:

http://www.scl.ameslab.gov/Projects/NetPIPE/

NetPIPE was originally developed by Quinn Snell, Armin Mikler,

John Gustafson, and Guy Helmer.

It is currently being developed and maintained by Dave Turner with

help from several graduate students (Xuehua Chen, Adam Oline,

Brian Smith, Bogdan Vasiliu).

Release 3.6.2 mainly fixes some bugs. A number of portability issues

with 64-bit architectures were taken care of, especially in the Infiniband

module. A small typecasting error was fixed that caused segmentation faults

on Red Hat Enterprise and Fedora Core systems (and probably others...). The

bi-directional mode was also tested with the Infiniband module, and a subset

of NetPIPE options are now supported.

Release 3.6.1 adds a bi-directional (-2) mode to allow data to be sent

in both directions simultaneously. This has been tested with the

TCP, MPI, MPI-2, and GM modules. You can also now test

synchronous MPI communications MPI_SSend/MPI_SRecv using (-S).

A launch utility (nplaunch) allows you to launch NPtcp, NPgm,

NPib, and NPpvm from one side using ssh to start the remote executible.

Version 3.6 adds the ability to test with and without cache effects,

and the ability to offset both the source and destination buffers.

A memcpy module has also been added.

Release 3.5 removes the CPU utilization measurements. Getrusage is

probably not very accurate, so a dummy workload will eventually be

used instead.

The streaming mode has also been fixed. When run at Gigabit speeds,

the TCP window size would collapse limit performance of subsequent

data points. Now we reset the sockets between trials to prevent this.

We have also added in a module to evaluate memory copy rates.

-n now sets a constant number of repeats for each trial.

-r resets the sockets between each trial (automatic for streaming).

Release 3.3 includes an Infiniband module for the Mellanox VAPI.

It also has an integrity check (-i), which is still being developed.

Version 3.2 includes additional modules to test

PVM, TCGMSG, SHMEM, and MPI-2, as well as the GM, GPSHMEM, ARMCI, and LAPI

software layers they run upon.

If you have problems or comments, please email netpipe@scl.ameslab.gov

____________________________________________________________________________

NetPIPE Network Protocol Independent Performance Evaluator, Release 2.3

This program is free software; you can redistribute it and/or modify

it under the terms of the GNU General Public License as published by

the Free Software Foundation. You should have received a copy of the

GNU General Public License along with this program; if not, write to the

Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

____________________________________________________________________________

Building NetPIPE

----------------

NetPIPE requires an ANSI C compiler. You are on your own for

installing the various libraries that NetPIPE can be used to

test.

Review the provided makefile and change any necessary settings, such

as the CC compiler or CFLAGS flags, required extra libraries, and PVM

library & include file pathnames if you have these communication

libraries. Alternatively, you can specify these changes on the

make command line. The line below would compile the NPtcp module

using the icc compiler instead of the default cc compiler.

make CC=icc tcp

Compile NetPIPE with the desired communication interface by using:

make mpi (this will use the default MPI on the system)

make pvm (you may need to set some paths in the makefile)

make tcgmsg (you will need to set some paths in the makefile)

make mpi2 (this will test 1-sided MPI_Put() functions)

make shmem (1-sided library for Cray and SGI systems)

make tcp

make gm (for Myrinet cards, you will need to set some paths)

make shmem (1-sided library for Cray and SGI systems)

make gpshmem (SHMEM interface for other machines)

make armci (still under development)

make lapi (for the IBM SP)

make ib (for Mellanox Infiniband adapters, uses VAPI layer)

make memcpy (uses memcpy to copy data between buffers in 1 process)

make MP_memcpy (uses an optimized copy in MP_memcpy.c to copy data between

buffers. This requires icc or gcc 3.x.)

Running NetPIPE

100

---------------

101

102

NetPIPE will dump its output to the screen by default and also

103

to the np.out. The following parameters can be used to change how

104

NetPIPE is run, and are in order of their general usefulness.

105

106

-b: specify send and receive TCP buffer sizes e.g. "-b 32768"

107

This can make a huge difference for Gigabit Ethernet cards.

108

You may need to tune the OS to set a larger maximum TCP

109

buffer size for optimal performance.

110

111

-O: specify send and optionally receive buffer offsets, e.g. "-O 1,3"

112

113

-l: lower bound (start value for block size) e.g. "-l 1"

114

115

-u: upper bound (stop value for block size) e.g. "-u 1048576"

116

117

-o: specify output filename e.g. "-o output.txt"

118

119

-z: for MPI, receive messages using ANYSOURCE

120

121

-g: MPI-2: use MPI_Get() instead of MPI_Put()

122

123

-f: MPI-2: do not use a fence call (may not work for all packages)

124

125

-I: Invalidate cache: Take measures to eliminate the effects cache

126

has on performance.

127

128

-a: asynchronous receive (a.k.a. pre-posted receive)

129

May not have any effect, depending on your implementation

130

131

-B: burst all preposts before measuring performance

132

Normally only one receive is preposted at a time with -a

133

134

-p: set perturbation offset of buffer size, e.g. "-p 3"

135

136

-i: Integrity check: Check the integrity of data transfer instead

137

of performance

138

139

-s: stream option (default mode is "ping pong")

140

If this option is used, it must be specified on both

141

the sending and receiving processes

142

143

-S: Use synchronous sends/receives for MPI.

144

145

-2: Bi-directional communications. Transmit in both directions

146

simultaneously.

147

148

TCP

149

---

150

151

Compile NetPIPE using 'make tcp'

152

153

remote_host> NPtcp [options]

154

local_host> NPtcp -h remote_host [options]

155

156

157

158

local_host> nplaunch NPtcp -h remote_host [options]

159

160

MPICH

161

-----

162

163

Install MPICH

164

Compile NetPIPE using 'make mpi'

165

use p4pg file or edit mpich/util/mach/mach.{ARCH} file

166

to specify the machines to run on

167

mpirun [-nolocal] -np 2 NPmpi [options]

168

'setenv P4_SOCKBUFSIZE 256000' can make a huge difference for

169

MPICH on Unix systems.

170

171

LAM/MPI (comes on the RedHat Linux distributions now)

172

-------

173

174

Install LAM

175

Compile NetPIPE using 'make mpi'

176

put the machine names into a lamhosts file

177

'lamboot -v -b lamhosts' to start the lamd daemons

178

mpirun -np 2 [-O] NPmpi [options]

179

The -O parameter avoids data translation for homogeneous systems.

180

181

MPI/Pro (commercial version)

182

-------

183

184

Install MPI/Pro

185

Compile NetPIPE using 'make mpi'

186

put the machine names into /etc/machines or a local machine file

187

mpirun -np 2 NPmpi [options]

188

189

MP_Lite (A lightweight version of MPI)

190

-------

191

192

Install MP_Lite (http://www.scl.ameslab.gov/Projects/MP_Lite/)

193

Compile NetPIPE using 'make MP_Lite'

194

mprun -np 2 -h {host1} {host2} NPmplite [options]

195

196

PVM

197

---

198

199

Install PVM (comes on the RedHat distributions now)

200

Set the PVM paths in the makefile if necessary.

201

Compile NetPIPE using 'make pvm'

202

use the 'pvm' utility to start the pvmd daemons

203

type 'pvm' to start it (this will also start pvmd on the local_host)

204

pvm> help --> lists all commands

205

pvm> add remote_host --> will start a pvmd on a machine called 'host2'

206

pvm> quit --> when you have all the pvmd machines started

207

remote_host> NPpvm [options]

208

local_host> NPpvm -h remote_host [options]

209

210

local_host> nplaunch NPpvm -h remote_host [options]

211

Changing PVMDATA in netpipe.h and PvmRouteDirect in pvm.c can

212

effect the performance greatly.

213

214

TCGMSG (unlikely anyone will try this that doesn't know TCGMSG well)

215

-------

216

217

Install TCGMSG package

218

Set the TCGMSG paths in the makefile.

219

Compile NetPIPE using 'make tcgmsg'

220

create a NPtcgmsg.p file with hosts and paths (see hosts/NPtcgmsg.p)

221

parallel NPtcgmsg

222

(no options can be passed into this version)

223

224

MPI-2

225

-----

226

227

Install the MPI package

228

Compile NetPIPE using 'make mpi2'

229

Follow the directions for running the MPI package from above

230

The MPI_Put() function will be tested with fence calls by default.

231

Use -g to test MPI_Get() instead, or -f to do MPI_Put() without

232

fence calls (will not work with LAM).

233

234

SHMEM

235

-----

236

237

Must be run on a Cray or SGI system that supports SHMEM calls.

238

Compile NetPIPE using 'make shmem'

239

(Xuehua, fill out the rest)

240

241

GPSHMEM (a General Purpose SHMEM library) (gpshmem.c in development)

242

-------

243

244

Ask Ricky or Krzysztof for help :).

245

246

GM (test the raw performance of GM on Myrinet cards)

247

248

249

Install the GM package and configure the Myrinet cards

250

Compile NetPIPE using 'make gm'

251

252

remote_host> NPgm [options]

253

local_host> NPgm -h remote_host [options]

254

255

256

257

local_host> nplaunch NPgm -h remote_host [options]

258

259

LAPI

260

----

261

262

Log into IBM SP machine at NERSC

263

Compile NetPIPE using 'make lapi'

264

265

To run interactively at NERSC:

266

267

Set environment variable MP_MSG_API to lapi

268

e.g. 'setenv MP_MSG_API lapi', 'export MP_MSG_API=lapi'

269

Run NPlapi with '-procs 2' to tell the parallel environment you

270

want 2 nodes. Use any other options that are applicable to

271

NetPIPE.

272

273

To submit a batch job at NERSC:

274

275

Copy the file batchLapi from the 'hosts' directory to the directory

276

containing NPlapi.

277

Edit the copy of batchLapi:

278

job_name: Identifying name of job, can be anything

279

output: File to send stdout to

280

error: File to send stderr to (most of NetPIPE's output

281

will go here)

282

tasks_per_node: Number of tasks to be run on each node

283

node: Number of nodes to run on

284

(Use a combination of the above two options to determine

285

how NetPIPE runs. Use 1 task per node and 2 nodes to run

286

benchmark between nodes. Use 2 tasks per node and 1 node

287

to run benchmark on single node)

288

289

Use whatever command-line options are appropriate for NetPIPE

290

291

Submit the job with the command 'llsubmit batchLapi'

292

Check status of all your jobs with 'llqs -u <user>'

293

You should receive an email when the job finishes. The resulting output

294

files will then be available.

295

296

ARMCI

297

-----

298

299

Install the ARMCI package

300

Compile NetPIPE using 'make armci'

301

Follow the directions for running the MPI package from above

302

If running on interfaces other than the default, create a file

303

called armci_hosts, containing two lines, one for each hostname,

304

then run package.

305

306

Infiniband

307

----------

308

309

This test will only work on machines connected via TCP/IP as well

310

as Infiniband.

311

Install Mellanox Infiniband adapters and software

312

Make sure the adapters are up and running (e.g. Check that the

313

Mellanox-supplied bandwidth/latency program, perf_main, works, if

314

you have it).

315

Compile NetPIPE using 'make ib' (The environment variable MTHOME needs

316

to be set to the directory containing the include and lib directories

317

for the Mellanox software).

318

319

remote_host> NPib [-options]

320

local_host> NPib -h remote_host [-options]

321

322

323

324

local_host> nplaunch NPib -h remote_host [options]

325

326

(remote_host should be the ip address or hostname of the other host)

327

328

Other options:

329

Use -m to select mtu size for Infiniband adapter.

330

Valid values are 256, 512, 1024, 2048, 4096. Default is 1024.

331

Use -t to select the communications type.

332

Possible values are

333

send_recv: basic send and receive

334

send_recv_with_imm: send and receive with immediate data

335

rdma_write: one-sided remote dma write

336

rdma_write_with_imm: one-sided remote dma write with immediate data

337

Default is send_recv.

338

Use -c to select the message completion type.

339

Possible values are

340

local_poll: poll on last byte of receive buffer

341

vapi_poll: use VAPI polling mechanism

342

event: use VAPI event completion mechanism

343

Default is local_poll.

344

345

Interpreting the Results

346

------------------------

347

348

NetPIPE generates a np.out file by default, which can be renamed using the

349

-o option. This file contains 3 columns: the number of bytes, the

350

throughput in Mbps, and the round trip time divided by two.

351

The first 2 columns can therefore be used to produce a throughput vs

352

message size graph.

353

354

The screen output contains this same information, plus the test number

355

and the number of ping-pong's involved in the test.

356

357

>more np.out

358

1 0.136403 0.00005593

359

2 0.274586 0.00005557

360

3 0.402104 0.00005692

361

4 0.545668 0.00005593

362

6 0.805053 0.00005686

363

8 1.039586 0.00005871

364

12 1.598912 0.00005726

365

13 1.700719 0.00005832

366

16 2.098007 0.00005818

367

19 2.340364 0.00006194

368

369

370

Invalidating Cache

371

------------------

372

373

The -I switch can be used to reduce the effects cache has on performance.

374

Without the switch, NetPIPE tests the performance of communicating

375

n-byte blocks by reading from an n-byte buffer on one node, sending data

376

over the communications link, and writing to an n-byte buffer on the other

377

node. For each block size, this trial will be repeated x times, where x

378

typically starts out very large for small block sizes, and decreases as the

379

block size grows. The same buffers on each node are used repeatedly, so

380

after the first transfer the entire buffer will be in cache on each node,

381

given that the block-size is less than the available cache. Thus each transfer

382

after the first will be read from cache on one end and written into cache on

383

the other. Depending on the cache architecture, a write to main memory may

384

not occur on the receiving end during the transfer loop.

385

386

While the performance measurements obtained from this method are certainly

387

useful, it is also interesting to use the -I switch to measure performance

388

when data is read from and written to main memory. In order to facilitate

389

this, large pools of memory are allocated at startup, and each n-byte transfer

390

comes from a region of the pool not in cache. Before each series of n-byte

391

transfers, every byte of a large dummy buffer is individually accessed in

392

order to flush the data for the transfer out of cache. After this step, the

393

first n-byte transfer comes from the beginning of the large pool, the second

394

comes from n-bytes after the beginning of the pool, and so on (note that stride

395

between n-byte transfers will depend on the buffer alignment setting). In this

396

way we make sure each read is coming from main memory.

397

398

On the receiving end data is written into a large pool in the same fashion

399

that it was read on the transmitting end. Data will first be written into

400

cache. What happens next depends on the cache architecture, but one case is

401

that no transfer to main memory occurs, YET. For moderately large block

402

sizes, however, a large number of transfer iterations will cause reuse of

403

cache memory. As this occurs, data in the cache location to be replaced must

404

be written back to main memory, so we incur a performance penalty while we

405

wait for the write.

406

407

In summary, using the -I switch gives worst-case performance (i.e. all data

408

transfers involve reading from or writing to memory not in cache) and not

409

using the switch gives best-case performance (i.e. all data transfers involve

410

only reading from or writing to memory in cache). Note that other combinations,

411

such as reading from memory in cache and writing to memory not in cache, would

412

give intermediary results. We chose to implement the methods that will measure

413

the two extremes.

414

415

Changes needed

416

--------------

417

418

- we need to replace the getrusage stuff from version 2.4 with a dummy

419

workload

420

Older »