51
49
model of a vapor/liquid interface, or a solid with an irregular-shaped
52
50
geometry containing void regions. In this case, the LAMMPS default of
53
51
dividing the simulation box volume into a regular-spaced grid of 3d
54
bricks, with one equal-volume sub-domain per procesor, may assign very
55
different numbers of particles per processor. This can lead to poor
56
performance in a scalability sense, when the simulation is run in
52
bricks, with one equal-volume sub-domain per processor, may assign
53
very different numbers of particles per processor. This can lead to
54
poor performance when the simulation is run in parallel.
59
56
Note that the "processors"_processors.html command allows some control
60
57
over how the box volume is split across processors. Specifically, for
61
58
a Px by Py by Pz grid of processors, it allows choice of Px, Py, and
62
59
Pz, subject to the constraint that Px * Py * Pz = P, the total number
63
60
of processors. This is sufficient to achieve good load-balance for
64
many models on many processor counts. However, all the processor
61
some problems on some processor counts. However, all the processor
65
62
sub-domains will still have the same shape and same volume.
67
64
On a particular timestep, a load-balancing operation is only performed
68
65
if the current "imbalance factor" in particles owned by each processor
69
exceeds the specified {thresh} parameter. This factor is defined as
70
the maximum number of particles owned by any processor, divided by the
71
average number of particles per processor. Thus an imbalance factor
72
of 1.0 is perfect balance. For 10000 particles running on 10
73
processors, if the most heavily loaded processor has 1200 particles,
74
then the factor is 1.2, meaning there is a 20% imbalance. Note that
75
re-balances can be forced even if the current balance is perfect (1.0)
76
be specifying a {thresh} < 1.0.
66
exceeds the specified {thresh} parameter. The imbalance factor is
67
defined as the maximum number of particles owned by any processor,
68
divided by the average number of particles per processor. Thus an
69
imbalance factor of 1.0 is perfect balance.
71
As an example, for 10000 particles running on 10 processors, if the
72
most heavily loaded processor has 1200 particles, then the factor is
73
1.2, meaning there is a 20% imbalance. Note that re-balances can be
74
forced even if the current balance is perfect (1.0) be specifying a
78
77
IMPORTANT NOTE: This command attempts to minimize the imbalance
79
78
factor, as defined above. But depending on the method a perfect
80
79
balance (1.0) may not be achieved. For example, "grid" methods
81
80
(defined below) that create a logical 3d grid cannot achieve perfect
82
81
balance for many irregular distributions of particles. Likewise, if a
83
portion of the system is a perfect lattice, e.g. the intiial system is
82
portion of the system is a perfect lattice, e.g. the initial system is
84
83
generated by the "create_atoms"_create_atoms.html command, then "grid"
85
84
methods may be unable to achieve exact balance. This is because
86
85
entire lattice planes will be owned or not owned by a single
89
IMPORTANT NOTE: Computational cost is not strictly proportional to
90
particle count, and changing the relative size and shape of processor
91
sub-domains may lead to additional computational and communication
92
overheads, e.g. in the PPPM solver used via the
93
"kspace_style"_kspace_style.html command. Thus you should benchmark
94
the run times of a simulation before and after balancing.
88
IMPORTANT NOTE: The imbalance factor is also an estimate of the
89
maximum speed-up you can hope to achieve by running a perfectly
90
balanced simulation versus an imbalanced one. In the example above,
91
the 10000 particle simulation could run up to 20% faster if it were
92
perfectly balanced, versus when imbalanced. However, computational
93
cost is not strictly proportional to particle count, and changing the
94
relative size and shape of processor sub-domains may lead to
95
additional computational and communication overheads, e.g. in the PPPM
96
solver used via the "kspace_style"_kspace_style.html command. Thus
97
you should benchmark the run times of a simulation before and after
102
106
The {shift} style is a "grid" method which produces a logical 3d grid
103
107
of processors. It operates by changing the cutting planes (or lines)
104
108
between processors in 3d (or 2d), to adjust the volume (area in 2d)
105
assigned to each processor, as in the following 2d diagram. The left
106
diagram is the default partitioning of the simulation box across
107
processors (one sub-box for each of 16 processors); the right diagram
109
assigned to each processor, as in the following 2d diagram where
110
processor sub-domains are shown and atoms are colored by the processor
111
that owns them. The leftmost diagram is the default partitioning of
112
the simulation box across processors (one sub-box for each of 16
113
processors); the middle diagram is after a "grid" method has been
110
:c,image(JPG/balance.jpg)
116
:c,image(JPG/balance_uniform_small.jpg,balance_uniform.jpg),image(JPG/balance_nonuniform_small.jpg,balance_nonuniform.jpg),image(JPG/balance_rcb_small.jpg,balance_rcb.jpg)
112
118
The {rcb} style is a "tiling" method which does not produce a logical
113
119
3d grid of processors. Rather it tiles the simulation domain with
114
120
rectangular sub-boxes of varying size and shape in an irregular
115
121
fashion so as to have equal numbers of particles in each sub-box, as
116
in the following 2d diagram. Again the left diagram is the default
117
partitioning of the simulation box across processors (one sub-box for
118
each of 16 processors); the right diagram is after balancing.
120
NOTE: Need a diagram of RCB partitioning.
122
in the rightmost diagram above.
122
124
The "grid" methods can be used with either of the
123
125
"comm_style"_comm_style.html command options, {brick} or {tiled}. The
225
227
The {rcb} style invokes a "tiled" method for balancing, as described
226
228
above. It performs a recursive coordinate bisectioning (RCB) of the
229
Need further description of RCB.
229
simulation domain. The basic idea is as follows.
231
The simulation domain is cut into 2 boxes by an axis-aligned cut in
232
the longest dimension, leaving one new box on either side of the cut.
233
All the processors are also partitioned into 2 groups, half assigned
234
to the box on the lower side of the cut, and half to the box on the
235
upper side. (If the processor count is odd, one side gets an extra
236
processor.) The cut is positioned so that the number of atoms in the
237
lower box is exactly the number that the processors assigned to that
238
box should own for load balance to be perfect. This also makes load
239
balance for the upper box perfect. The positioning is done
240
iteratively, by a bisectioning method. Note that counting atoms on
241
either side of the cut requires communication between all processors
244
That is the procedure for the first cut. Subsequent cuts are made
245
recursively, in exactly the same manner. The subset of processors
246
assigned to each box make a new cut in the longest dimension of that
247
box, splitting the box, the subset of processsors, and the atoms in
248
the box in two. The recursion continues until every processor is
249
assigned a sub-box of the entire simulation domain, and owns the atoms
239
260
processors for a 2d problem:
264
ITEM: NUMBER OF NODES
243
289
ITEM: NUMBER OF SQUARES
252
ITEM: NUMBER OF NODES
264
6 1 -153.919 15.3919 0
265
7 1 7.45545 15.3919 0
266
8 1 14.7305 15.3919 0
268
10 1 184.703 15.3919 0 :pre
270
The "SQUARES" lists the node IDs of the 4 vertices in a rectangle for
271
each processor (1 to 4). The first SQUARE 1 (for processor 0) is a
272
rectangle of type 1 (equal to SQUARE ID) and contains vertices
273
1,2,7,6. The coordinates of all the vertices are listed in the NODES
274
section. Note that the 4 sub-domains share vertices, so there are
275
only 10 unique vertices in total.
277
For a 3d problem, the syntax is similar with "SQUARES" replaced by
278
"CUBES", and 8 vertices listed for each processor, instead of 4.
280
Each time rebalancing is performed a new timestamp is written with new
281
NODES values. The SQUARES of CUBES sections are not repeated, since
297
The coordinates of all the vertices are listed in the NODES section, 5
298
per processor. Note that the 4 sub-domains share vertices, so there
299
will be duplicate nodes in the list.
301
The "SQUARES" section lists the node IDs of the 4 vertices in a
302
rectangle for each processor (1 to 4).
304
For a 3d problem, the syntax is similar with 8 vertices listed for
305
each processor, instead of 4, and "SQUARES" replaced by "CUBES".