66
65
dividing the simulation box volume into a regular-spaced grid of 3d
67
66
bricks, with one equal-volume sub-domain per procesor, may assign very
68
67
different numbers of particles per processor. This can lead to poor
69
performance in a scalability sense, when the simulation is run in
68
performance when the simulation is run in parallel.
72
70
Note that the "processors"_processors.html command allows some control
73
71
over how the box volume is split across processors. Specifically, for
74
72
a Px by Py by Pz grid of processors, it allows choice of Px, Py, and
75
73
Pz, subject to the constraint that Px * Py * Pz = P, the total number
76
74
of processors. This is sufficient to achieve good load-balance for
77
many models on many processor counts. However, all the processor
75
some problems on some processor counts. However, all the processor
78
76
sub-domains will still have the same shape and same volume.
80
78
The requested load-balancing operation is only performed if the
81
79
current "imbalance factor" in particles owned by each processor
82
exceeds the specified {thresh} parameter. This factor is defined as
83
the maximum number of particles owned by any processor, divided by the
84
average number of particles per processor. Thus an imbalance factor
85
of 1.0 is perfect balance. For 10000 particles running on 10
86
processors, if the most heavily loaded processor has 1200 particles,
87
then the factor is 1.2, meaning there is a 20% imbalance. Note that a
88
re-balance can be forced even if the current balance is perfect (1.0)
89
be specifying a {thresh} < 1.0.
91
When the balance command completes, it prints statistics about its
92
results, including the change in the imbalance factor and the change
93
in the maximum number of particles (on any processor). For "grid"
94
methods (defined below) that create a logical 3d grid of processors,
95
the positions of all cutting planes in each of the 3 dimensions (as
80
exceeds the specified {thresh} parameter. The imbalance factor is
81
defined as the maximum number of particles owned by any processor,
82
divided by the average number of particles per processor. Thus an
83
imbalance factor of 1.0 is perfect balance.
85
As an example, for 10000 particles running on 10 processors, if the
86
most heavily loaded processor has 1200 particles, then the factor is
87
1.2, meaning there is a 20% imbalance. Note that a re-balance can be
88
forced even if the current balance is perfect (1.0) be specifying a
91
IMPORTANT NOTE: Balancing is performed even if the imbalance factor
92
does not exceed the {thresh} parameter if a "grid" style is specified
93
when the current partitioning is "tiled". The meaning of "grid" vs
94
"tiled" is explained below. This is to allow forcing of the
95
partitioning to "grid" so that the "comm_style brick"_comm_style.html
96
command can then be used to replace a current "comm_style
97
tiled"_comm_style.html setting.
99
When the balance command completes, it prints statistics about the
100
result, including the change in the imbalance factor and the change in
101
the maximum number of particles on any processor. For "grid" methods
102
(defined below) that create a logical 3d grid of processors, the
103
positions of all cutting planes in each of the 3 dimensions (as
96
104
fractions of the box length) are also printed.
98
106
IMPORTANT NOTE: This command attempts to minimize the imbalance
106
114
entire lattice planes will be owned or not owned by a single
109
IMPORTANT NOTE: Computational cost is not strictly proportional to
110
particle count, and changing the relative size and shape of processor
111
sub-domains may lead to additional computational and communication
112
overheads, e.g. in the PPPM solver used via the
113
"kspace_style"_kspace_style.html command. Thus you should benchmark
114
the run times of a simulation before and after balancing.
117
IMPORTANT NOTE: The imbalance factor is also an estimate of the
118
maximum speed-up you can hope to achieve by running a perfectly
119
balanced simulation versus an imbalanced one. In the example above,
120
the 10000 particle simulation could run up to 20% faster if it were
121
perfectly balanced, versus when imbalanced. However, computational
122
cost is not strictly proportional to particle count, and changing the
123
relative size and shape of processor sub-domains may lead to
124
additional computational and communication overheads, e.g. in the PPPM
125
solver used via the "kspace_style"_kspace_style.html command. Thus
126
you should benchmark the run times of a simulation before and after
118
131
The method used to perform a load balance is specified by one of the
119
listed styles, which are described in detail below. There are 2 kinds
132
listed styles (or more in the case of {x},{y},{z}), which are
133
described in detail below. There are 2 kinds of styles.
122
135
The {x}, {y}, {z}, and {shift} styles are "grid" methods which produce
123
136
a logical 3d grid of processors. They operate by changing the cutting
124
137
planes (or lines) between processors in 3d (or 2d), to adjust the
125
138
volume (area in 2d) assigned to each processor, as in the following 2d
126
diagram. The left diagram is the default partitioning of the
127
simulation box across processors (one sub-box for each of 16
128
processors); the right diagram is after balancing.
139
diagram where processor sub-domains are shown and atoms are colored by
140
the processor that owns them. The leftmost diagram is the default
141
partitioning of the simulation box across processors (one sub-box for
142
each of 16 processors); the middle diagram is after a "grid" method
130
:c,image(JPG/balance.jpg)
145
:c,image(JPG/balance_uniform_small.jpg,balance_uniform.jpg),image(JPG/balance_nonuniform_small.jpg,balance_nonuniform.jpg),image(JPG/balance_rcb_small.jpg,balance_rcb.jpg)
132
147
The {rcb} style is a "tiling" method which does not produce a logical
133
148
3d grid of processors. Rather it tiles the simulation domain with
134
149
rectangular sub-boxes of varying size and shape in an irregular
135
150
fashion so as to have equal numbers of particles in each sub-box, as
136
in the following 2d diagram. Again the left diagram is the default
137
partitioning of the simulation box across processors (one sub-box for
138
each of 16 processors); the right diagram is after balancing.
140
NOTE: Need a diagram of RCB partitioning.
151
in the rightmost diagram above.
142
153
The "grid" methods can be used with either of the
143
154
"comm_style"_comm_style.html command options, {brick} or {tiled}. The
164
176
The {x}, {y}, and {z} styles invoke a "grid" method for balancing, as
165
177
described above. Note that any or all of these 3 styles can be
166
specified together, one after the other. This style adjusts the
167
position of cutting planes between processor sub-domains in specific
168
dimensions. Only the specified dimensions are altered.
178
specified together, one after the other, but they cannot be used with
179
any other style. This style adjusts the position of cutting planes
180
between processor sub-domains in specific dimensions. Only the
181
specified dimensions are altered.
170
183
The {uniform} argument spaces the planes evenly, as in the left
171
184
diagrams above. The {numeric} argument requires listing Ps-1 numbers
244
257
The {rcb} style invokes a "tiled" method for balancing, as described
245
258
above. It performs a recursive coordinate bisectioning (RCB) of the
248
Need further description of RCB.
259
simulation domain. The basic idea is as follows.
261
The simulation domain is cut into 2 boxes by an axis-aligned cut in
262
the longest dimension, leaving one new box on either side of the cut.
263
All the processors are also partitioned into 2 groups, half assigned
264
to the box on the lower side of the cut, and half to the box on the
265
upper side. (If the processor count is odd, one side gets an extra
266
processor.) The cut is positioned so that the number of atoms in the
267
lower box is exactly the number that the processors assigned to that
268
box should own for load balance to be perfect. This also makes load
269
balance for the upper box perfect. The positioning is done
270
iteratively, by a bisectioning method. Note that counting atoms on
271
either side of the cut requires communication between all processors
274
That is the procedure for the first cut. Subsequent cuts are made
275
recursively, in exactly the same manner. The subset of processors
276
assigned to each box make a new cut in the longest dimension of that
277
box, splitting the box, the subset of processsors, and the atoms in
278
the box in two. The recursion continues until every processor is
279
assigned a sub-box of the entire simulation domain, and owns the atoms
294
ITEM: NUMBER OF NODES
262
319
ITEM: NUMBER OF SQUARES
271
ITEM: NUMBER OF NODES
283
6 1 -153.919 15.3919 0
284
7 1 7.45545 15.3919 0
285
8 1 14.7305 15.3919 0
287
10 1 184.703 15.3919 0 :pre
289
The "SQUARES" lists the node IDs of the 4 vertices in a rectangle for
290
each processor (1 to 4). The first SQUARE 1 (for processor 0) is a
291
rectangle of type 1 (equal to SQUARE ID) and contains vertices
292
1,2,7,6. The coordinates of all the vertices are listed in the NODES
293
section. Note that the 4 sub-domains share vertices, so there are
294
only 10 unique vertices in total.
296
For a 3d problem, the syntax is similar with "SQUARES" replaced by
297
"CUBES", and 8 vertices listed for each processor, instead of 4.
327
The coordinates of all the vertices are listed in the NODES section, 5
328
per processor. Note that the 4 sub-domains share vertices, so there
329
will be duplicate nodes in the list.
331
The "SQUARES" section lists the node IDs of the 4 vertices in a
332
rectangle for each processor (1 to 4).
334
For a 3d problem, the syntax is similar with 8 vertices listed for
335
each processor, instead of 4, and "SQUARES" replaced by "CUBES".