63
61
model of a vapor/liquid interface, or a solid with an irregular-shaped
64
62
geometry containing void regions. In this case, the LAMMPS default of
65
63
dividing the simulation box volume into a regular-spaced grid of 3d
66
bricks, with one equal-volume sub-domain per procesor, may assign very
67
different numbers of particles per processor. This can lead to poor
68
performance in a scalability sense, when the simulation is run in
64
bricks, with one equal-volume sub-domain per processor, may assign
65
very different numbers of particles per processor. This can lead to
66
poor performance when the simulation is run in parallel.
71
68
<P>Note that the <A HREF = "processors.html">processors</A> command allows some control
72
69
over how the box volume is split across processors. Specifically, for
73
70
a Px by Py by Pz grid of processors, it allows choice of Px, Py, and
74
71
Pz, subject to the constraint that Px * Py * Pz = P, the total number
75
72
of processors. This is sufficient to achieve good load-balance for
76
many models on many processor counts. However, all the processor
73
some problems on some processor counts. However, all the processor
77
74
sub-domains will still have the same shape and same volume.
79
76
<P>On a particular timestep, a load-balancing operation is only performed
80
77
if the current "imbalance factor" in particles owned by each processor
81
exceeds the specified <I>thresh</I> parameter. This factor is defined as
82
the maximum number of particles owned by any processor, divided by the
83
average number of particles per processor. Thus an imbalance factor
84
of 1.0 is perfect balance. For 10000 particles running on 10
85
processors, if the most heavily loaded processor has 1200 particles,
86
then the factor is 1.2, meaning there is a 20% imbalance. Note that
87
re-balances can be forced even if the current balance is perfect (1.0)
88
be specifying a <I>thresh</I> < 1.0.
78
exceeds the specified <I>thresh</I> parameter. The imbalance factor is
79
defined as the maximum number of particles owned by any processor,
80
divided by the average number of particles per processor. Thus an
81
imbalance factor of 1.0 is perfect balance.
83
<P>As an example, for 10000 particles running on 10 processors, if the
84
most heavily loaded processor has 1200 particles, then the factor is
85
1.2, meaning there is a 20% imbalance. Note that re-balances can be
86
forced even if the current balance is perfect (1.0) be specifying a
90
89
<P>IMPORTANT NOTE: This command attempts to minimize the imbalance
91
90
factor, as defined above. But depending on the method a perfect
92
91
balance (1.0) may not be achieved. For example, "grid" methods
93
92
(defined below) that create a logical 3d grid cannot achieve perfect
94
93
balance for many irregular distributions of particles. Likewise, if a
95
portion of the system is a perfect lattice, e.g. the intiial system is
94
portion of the system is a perfect lattice, e.g. the initial system is
96
95
generated by the <A HREF = "create_atoms.html">create_atoms</A> command, then "grid"
97
96
methods may be unable to achieve exact balance. This is because
98
97
entire lattice planes will be owned or not owned by a single
101
<P>IMPORTANT NOTE: Computational cost is not strictly proportional to
102
particle count, and changing the relative size and shape of processor
103
sub-domains may lead to additional computational and communication
104
overheads, e.g. in the PPPM solver used via the
105
<A HREF = "kspace_style.html">kspace_style</A> command. Thus you should benchmark
106
the run times of a simulation before and after balancing.
100
<P>IMPORTANT NOTE: The imbalance factor is also an estimate of the
101
maximum speed-up you can hope to achieve by running a perfectly
102
balanced simulation versus an imbalanced one. In the example above,
103
the 10000 particle simulation could run up to 20% faster if it were
104
perfectly balanced, versus when imbalanced. However, computational
105
cost is not strictly proportional to particle count, and changing the
106
relative size and shape of processor sub-domains may lead to
107
additional computational and communication overheads, e.g. in the PPPM
108
solver used via the <A HREF = "kspace_style.html">kspace_style</A> command. Thus
109
you should benchmark the run times of a simulation before and after
114
118
<P>The <I>shift</I> style is a "grid" method which produces a logical 3d grid
115
119
of processors. It operates by changing the cutting planes (or lines)
116
120
between processors in 3d (or 2d), to adjust the volume (area in 2d)
117
assigned to each processor, as in the following 2d diagram. The left
118
diagram is the default partitioning of the simulation box across
119
processors (one sub-box for each of 16 processors); the right diagram
121
assigned to each processor, as in the following 2d diagram where
122
processor sub-domains are shown and atoms are colored by the processor
123
that owns them. The leftmost diagram is the default partitioning of
124
the simulation box across processors (one sub-box for each of 16
125
processors); the middle diagram is after a "grid" method has been
122
<CENTER><IMG SRC = "JPG/balance.jpg">
128
<CENTER><A HREF = "balance_uniform.jpg"><IMG SRC = "JPG/balance_uniform_small.jpg"></A><A HREF = "balance_nonuniform.jpg"><IMG SRC = "JPG/balance_nonuniform_small.jpg"></A><A HREF = "balance_rcb.jpg"><IMG SRC = "JPG/balance_rcb_small.jpg"></A>
124
130
<P>The <I>rcb</I> style is a "tiling" method which does not produce a logical
125
131
3d grid of processors. Rather it tiles the simulation domain with
126
132
rectangular sub-boxes of varying size and shape in an irregular
127
133
fashion so as to have equal numbers of particles in each sub-box, as
128
in the following 2d diagram. Again the left diagram is the default
129
partitioning of the simulation box across processors (one sub-box for
130
each of 16 processors); the right diagram is after balancing.
132
<P>NOTE: Need a diagram of RCB partitioning.
134
in the rightmost diagram above.
134
136
<P>The "grid" methods can be used with either of the
135
137
<A HREF = "comm_style.html">comm_style</A> command options, <I>brick</I> or <I>tiled</I>. The
237
239
<P>The <I>rcb</I> style invokes a "tiled" method for balancing, as described
238
240
above. It performs a recursive coordinate bisectioning (RCB) of the
241
<P>Need further description of RCB.
241
simulation domain. The basic idea is as follows.
243
<P>The simulation domain is cut into 2 boxes by an axis-aligned cut in
244
the longest dimension, leaving one new box on either side of the cut.
245
All the processors are also partitioned into 2 groups, half assigned
246
to the box on the lower side of the cut, and half to the box on the
247
upper side. (If the processor count is odd, one side gets an extra
248
processor.) The cut is positioned so that the number of atoms in the
249
lower box is exactly the number that the processors assigned to that
250
box should own for load balance to be perfect. This also makes load
251
balance for the upper box perfect. The positioning is done
252
iteratively, by a bisectioning method. Note that counting atoms on
253
either side of the cut requires communication between all processors
256
<P>That is the procedure for the first cut. Subsequent cuts are made
257
recursively, in exactly the same manner. The subset of processors
258
assigned to each box make a new cut in the longest dimension of that
259
box, splitting the box, the subset of processsors, and the atoms in
260
the box in two. The recursion continues until every processor is
261
assigned a sub-box of the entire simulation domain, and owns the atoms
251
272
processors for a 2d problem:
253
274
<PRE>ITEM: TIMESTEP
276
ITEM: NUMBER OF NODES
255
301
ITEM: NUMBER OF SQUARES
264
ITEM: NUMBER OF NODES
276
6 1 -153.919 15.3919 0
277
7 1 7.45545 15.3919 0
278
8 1 14.7305 15.3919 0
280
10 1 184.703 15.3919 0
282
<P>The "SQUARES" lists the node IDs of the 4 vertices in a rectangle for
283
each processor (1 to 4). The first SQUARE 1 (for processor 0) is a
284
rectangle of type 1 (equal to SQUARE ID) and contains vertices
285
1,2,7,6. The coordinates of all the vertices are listed in the NODES
286
section. Note that the 4 sub-domains share vertices, so there are
287
only 10 unique vertices in total.
289
<P>For a 3d problem, the syntax is similar with "SQUARES" replaced by
290
"CUBES", and 8 vertices listed for each processor, instead of 4.
292
<P>Each time rebalancing is performed a new timestamp is written with new
293
NODES values. The SQUARES of CUBES sections are not repeated, since
309
<P>The coordinates of all the vertices are listed in the NODES section, 5
310
per processor. Note that the 4 sub-domains share vertices, so there
311
will be duplicate nodes in the list.
313
<P>The "SQUARES" section lists the node IDs of the 4 vertices in a
314
rectangle for each processor (1 to 4).
316
<P>For a 3d problem, the syntax is similar with 8 vertices listed for
317
each processor, instead of 4, and "SQUARES" replaced by "CUBES".