1
<!--#include virtual="header.txt"-->
3
<h1>Multifactor Priority Plugin</h1>
5
<b>Note:</b> This document describes features added to SLURM version 2.0.
9
<LI> <a href=#intro>Introduction</a>
10
<LI> <a href=#mfjppintro>Multi-factor Job Priority Plugin</a>
11
<LI> <a href=#general>Job Priority Factors In General</a>
12
<LI> <a href=#age>Age Factor</a>
13
<LI> <a href=#jobsize>Job Size Factor</a>
14
<LI> <a href=#partition>Partition Factor</a>
15
<LI> <a href=#qos>Quality of Service (QOS) Factor</a>
16
<LI> <a href=#fairshare>Fair-share Factor</a>
17
<LI> <a href=#sprio>The <i>sprio</i> utility</a>
18
<LI> <a href=#config>Configuration</a>
19
<LI> <a href=#configexample>Configuration Example</a>
22
<!-------------------------------------------------------------------------->
24
<h2>Introduction</h2></a>
26
<P> By default, SLURM assigns job priority on a First In, First Out (FIFO) basis. FIFO scheduling should be configured when SLURM is controlled by an external scheduler.</P>
28
<P> The <i>PriorityType</i> parameter in the slurm.conf file selects the priority plugin. The default value for this variable is "priority/basic" which enables simple FIFO scheduling. (See <a href="#config">Configuration</a> below)</P>
30
<P> SLURM version 2.0 includes the Multi-factor Job Priority plugin. This plugin provides a very versatile facility for ordering the queue of jobs waiting to be scheduled.</P>
32
<!-------------------------------------------------------------------------->
34
<h2>Multi-factor 'Factors'</h2></a>
36
<P> There are five factors in the Multi-factor Job Priority plugin that influence job priority:</P>
40
<DD> the length of time a job has been waiting in the queue, eligible to be scheduled
42
<DD> the difference between the portion of the computing resource that has been promised and the amount of resources that has been consumed
44
<DD> the number of nodes a job is allocated
46
<DD> a factor associated with each node partition
48
<DD> a factor associated with each Quality Of Service (Still under Development)
51
<P> Additionally, a weight can be assigned to each of the above
52
factors. This provides the ability to enact a policy that blends a
53
combination of any of the above factors in any portion desired. For
54
example, a site could configure fair-share to be the dominant factor
55
(say 70%), set the job size and the age factors to each contribute
56
15%, and set the partition and QOS influences to zero.</P>
58
<!-------------------------------------------------------------------------->
60
<h2>Job Priority Factors In General</h2></a>
62
<P> The job's priority at any given time will be a weighted sum of all the factors that have been enabled in the slurm.conf file. Job priority can be expressed as:</P>
65
(age_weight) * (age_factor) +
66
(fair-share_weight) * (fair-share_factor) +
67
(job_size_weight) * (job_size_factor) +
68
(partition_weight) * (partition_factor) +
69
(QOS_weight) * (QOS_factor)
72
<P> All of the factors in this formula are floating point numbers that
73
range from 0.0 to 1.0. The weights are unsigned, 32 bit integers.
74
The job's priority is an integer that ranges between 0 and
75
4294967295. The higher the number, the higher the job will be
76
positioned in the queue, and the sooner the job will be scheduled.
77
A job's priority, and hence its order in the queue, will vary over
78
time. For example, the longer a job sits in the queue, the higher
79
its priority will grow when the age_weight is non-zero.</P>
81
<P> <b>IMPORTANT:</b> The weight values should be high enough to get a
82
good set of significant digits since all the factors are floating
83
point numbers from 0.0 to 1.0. For example, one job could have a
84
fair-share factor of .59534 and another job could have a fair-share
85
factor of .50002. If the fair-share weight is only set to 10, both
86
jobs would have the same fair-share priority. Therefore, set the
87
weights high enough to avoid this scenario, starting around 1000 or
88
so for those factors you want to make predominant.</P>
90
<!-------------------------------------------------------------------------->
92
<h2>Age Factor</h2></a>
94
<P> The age factor represents the length of time a job has been sitting in the queue and eligible to run. In general, the longer a job waits in the queue, the larger its age factor grows. However, the age factor for a dependent job will not change while it waits for the job it depends on to complete. Also, the age factor will not change when scheduling is withheld for a job whose node or time limits exceed the cluster's current limits.</P>
96
<P> At some configurable length of time (<i>PriorityMaxAge</i>), the age factor will max out to 1.0.</P>
98
<!-------------------------------------------------------------------------->
100
<h2>Job Size Factor</h2></a>
102
<P> The job size factor correlates to the number of nodes the job has requested. This factor can be configured to favor larger jobs or smaller jobs based on the state of the <i>PriorityFavorSmall</i> boolean in the slurm.conf file. When <i>PriorityFavorSmall</i> is NO, the larger the job, the greater its job size factor will be. A job that requests all the nodes on the machine will get a job size factor of 1.0. When the <i>PriorityFavorSmall</i> Boolean is YES, the single node job will receive the 1.0 job size factor.</P>
104
<!-------------------------------------------------------------------------->
106
<h2>Partition Factor</h2></a>
108
<P> Each node partition can be assigned a factor from 0.0 to 1.0. The higher the number, the greater the job priority will be for jobs that are slated to run in this partition.</P>
110
<!-------------------------------------------------------------------------->
112
<h2>Quality of Service (QOS) Factor</h2></a>
114
<P> Each QOS can be assigned a factor from 0.0 to 1.0. The higher the
115
number, the greater the job priority will be for jobs that request
116
this QOS. (Still under Development)</P>
118
<!-------------------------------------------------------------------------->
120
<h2>Fair-share Factor</h2></a>
122
<b>Note:</b> Computing the fair-share factor requires the installation
123
and operation of the <a href="accounting.html">SLURM Accounting
124
Database</a> to provide the assigned shares and the consumed,
125
computing resources described below.
127
<P> The fair-share component to a job's priority influences the order in which a user's queued jobs are scheduled to run based on the portion of the computing resources they have been allocated and the resources their jobs have already consumed. The fair-share factor does not involve a fixed allotment, whereby a user's access to a machine is cut off once that allotment is reached.</P>
129
<P> Instead, the fair-share factor serves to prioritize queued jobs such that those jobs charging accounts that are under-serviced are scheduled first, while jobs charging accounts that are over-serviced are scheduled when the machine would otherwise go idle.</P>
131
<P> SLURM's fair-share factor is a floating point number between 0.0 and 1.0 that reflects the shares of a computing resource that a user has been allocated and the amount of computing resources the user's jobs have consumed. The higher the value, the higher is the placement in the queue of jobs waiting to be scheduled.</P>
133
<P> The computing resource is currently defined to be computing cycles delivered by a machine in the units of processor*seconds. Future versions of the fair-share factor may additionally include a memory integral component.</P>
135
<h3> Normalized Shares</h3>
137
<P> The fair-share hierarchy represents the portions of the computing resource that have been allocated to multiple projects. These allocations are assigned to an account. There can be multiple levels of allocations made as allocations of a given account are further divided to sub-accounts:</P>
140
<img src=AllocationPies.gif width=400 ><BR>
141
Figure 1. Machine Allocation
144
<P> The chart above shows the resources of the machine allocated to four accounts, A, B, C and D. Furthermore, account A's shares are further allocated to sub accounts, A1 through A4. Users are granted permission (through sacctmgr) to submit jobs against specific accounts. If there are 10 users given equal shares in Account A3, they will each be allocated 1% of the machine.</P>
146
<P> A user's normalized shares is simply</P>
149
S = (S<sub>user</sub> / S<sub>sibblings</sub>) *
150
(S<sub>account</sub> / S<sub>sibbling-accounts</sub>) *
151
(S<sub>parent</sub> / S<sub>parent-sibblings</sub>) * ...
158
<DD> is the user's normalized share, between zero and one
159
<DT> S<sub>user</sub>
160
<DD> are the number of shares of the account allocated to the user
161
<DT> S<sub>sibblings</sub>
162
<DD> are the total number of shares allocated to all users permitted to charge the account (including S<sub>user</sub>)
163
<DT> S<sub>account</sub>
164
<DD> are the number of shares of the parent account allocated to the account
165
<DT> S<sub>sibbling-accounts</sub>
166
<DD> are the total number of shares allocated to all sub-accounts of the parent account
167
<DT> S<sub>parent</sub>
168
<DD> are the number of shares of the grandparent account allocated to the parent
169
<DT> S<sub>parent-sibblings</sub>
170
<DD> are the total number of shares allocated to all sub-accounts of the grandparent account
173
<h3> Normalized Usage</h3>
175
<P> The total number of processor*seconds that a machine is able to deliver over a fixed time period (for example, a day) is a fixed quantity. The processor*seconds allocated to every job are tracked and saved to the SLURM database in real-time. If one only considered usage over a fixed time period, then calculating a user's normalized usage would be a simple quotient:</P>
178
U<sub>N</sub> = U<sub>user</sub> / R<sub>available</sub>
185
<DD> is normalized usage, between zero and one
186
<DT> U<sub>user</sub>
187
<DD> is the processor*seconds consumed by all of a user's jobs in a given account for over a fixed time period
188
<DT> R<sub>available</sub>
189
<DD> is the total number of processor*seconds a machine can deliver during that same time period
192
<P> However, significant real-world usage quantities span multiple time periods. Rather than treating usage over a number of weeks or months with equal importance, SLURM's fair-share priority calculation places more importance on the most recent resource usage and less importance on usage from the distant past.</P>
194
<P> The SLURM usage metric is based off a half-life formula that favors the most recent usage statistics. Usage statistics from the past decrease in importance based on a single decay factor, D:</P>
197
U<sub>H</sub> = U<sub>current_period</sub> +
198
( D * U<sub>last_period</sub>) + (D * D * U<sub>period-2</sub>) + ...
205
<DD> is the historical usage subject to the half-life decay
206
<DT> U<sub>current_period</sub>
207
<DD> is the usage charged over the current measurement period
208
<DT> U<sub>last_period</sub>
209
<DD> is the usage charged over the last measurement period
210
<DT> U<sub>period-2</sub>
211
<DD> is the usage charged over the second last measurement period
213
<DD> is a decay factor between zero and one that delivers the
214
half-life decay based off the <i>PriorityDecayHalfLife</i> setting
215
in the slurm.conf file. Without accruing additional usage, a user's
216
U<sub>H</sub> usage will decay to 1/2 value after a time period
217
of <i>PriorityDecayHalfLife</i> seconds.
220
<P> In practice, the <i>PriorityDecayHalfLife</i> could be a matter of
221
seconds or days as appropriate for each site. The measurement
222
period is nominally 5 minutes. The decay factor, D, is assigned the
223
value that will achieve the half-life decay rate specified by
224
the <i>PriorityDecayHalfLife</i> parameter.</P>
226
<P> The historical resources a machine has available could be similarly aggregated with the same decay factor:</P>
229
R<sub>H</sub> = R<sub>current_period</sub> +
230
( D * R<sub>last_period</sub>) + (D * D * R<sub>period-2</sub>) + ...
233
<P> However, A simpler formula is:</P>
236
R<sub>H</sub> = num_procs * half_life * 2
243
<DD> is the historical resources available subject to the same half-life decay as the usage formula.
245
<DD> is the total number of processors in the cluster
247
<DD> is the configured half-life(<i>PriorityDecayHalfLife</i>)
250
<P> A user's normalized usage that spans multiple time periods then becomes:</P>
253
U = U<sub>H</sub> / R<sub>H</sub>
257
<h3>Simplified Fair-Share Formula</h3>
259
<P> The simplified formula for calculating the fair-share factor for usage that spans multiple time periods and subject to a half-life decay is:</P>
269
<DD> is the fair-share factor
271
<DD> is the normalized shares
273
<DD> is the normalized usage factoring in half-life decay
276
<P> The fair-share factor will therefore range from zero to one, where one represents the highest priority for a job. A fair-share factor of 0.5 indicates that the user's jobs have used exactly the portion of the machine that they have been allocated. A fair-share factor of above 0.5 indicates that the user's jobs have consumed less than their allocated share while a fair-share factor below 0.5 indicates that the user's jobs have consumed more than their allocated share of the computing resources.</P>
278
<h3>The Fair-share Factor Under An Account Hierarchy</h3>
280
<P> The method described above presents a system whereby the priority of a user's job is calculated based on the portion of the machine allocated to the user and the historical usage of all the jobs run by that user under a specific account.</P>
282
<P> Another layer of "fairness" is necessary however, one that factors in the usage of other users drawing from the same account. This allows a job's fair-share factor to be influenced by the computing resources delivered to jobs of other users drawing from the same account.</P>
284
<P> If there are two members of a given account, and if one of those users has run many jobs under that account, the job priority of a job submitted by the user who has not run any jobs will be negatively affected. This ensures that the combined usage charged to an account matches the portion of the machine that is allocated to that account.</P>
286
<P> In the example below, when user 3 submits their first job using account C, they will want their job's priority to reflect all the resources delivered to account B. They do not care that user 1 has been using up a significant portion of the cycles allocated to account B and user 2 has yet to run a job out of account B. If user 2 submits a job using account B and user 3 submits a job using account C, user 3 expects their job to be scheduled before the job from user 2.</P>
289
<img src=UsagePies.gif width=400 ><BR>
290
Figure 2. Usage Example
293
<h3>The SLURM Fair-Share Formula</h3>
295
<P> The SLURM fair-share formula has been designed to provide fair scheduling to users based on the allocation and usage of every account.</P>
297
<P> The actual formula used is a refinement of the formula presented above:</P>
300
F = (S - U<sub>E</sub> + 1) / 2
303
<P> The difference is that the usage term is effective usage, which is defined as:</P>
306
U<sub>E</sub> = U<sub>Achild</sub> +
307
((U<sub>Eparent</sub> - U<sub>Achild</sub>) * S<sub>child</sub>/S<sub>all_siblings</sub>)
314
<DD> is the effective usage of the child user or child account
315
<DT> U<sub>Achild</sub>
316
<DD> is the actual usage of the child user or child account
317
<DT> U<sub>Eparent</sub>
318
<DD> is the effective usage of the parent account
319
<DT> S<sub>child</sub>
320
<DD> is the shares allocated to the child user or child account
321
<DT> S<sub>all_siblings</sub>
322
<DD> is the shares allocated to all the children of the parent account
325
<P> This formula only applies with the second tier of accounts below root. For the tier of accounts just under root, their effective usage equals their actual usage.</P>
327
<P> Because the formula for effective usage includes a term of the effective usage of the parent, the calculation for each account in the tree must start at the second tier of accounts and proceed downward: to the children accounts, then grandchildren, etc. The effective usage of the users will be the last to be calculated.</P>
329
<P> Plugging in the effective usage into the fair-share formula above yields a fair-share factor that reflects the aggregated usage charged to each of the accounts in the fair-share hierarchy.</P>
333
<P> The following example demonstrates the effective usage calculations and resultant fair-share factors. (See Figure 3 below.)</P>
335
<P> The machine's computing resources are allocated to accounts A and D with 40 and 60 shares respectively. Account A is further divided into two children accounts, B with 30 shares and C with 10 shares. Account D is further divided into two children accounts, E with 25 shares and F with 35 shares.</P>
337
<P> Note: the shares at any given tier in the Account hierarchy do not need to total up to 100 shares. This example shows them totaling up to 100 to make the arithmetic easier to follow in your head.</P>
339
<P> User 1 is granted permission to submit jobs against the B account. Users 2 and 3 are granted one share each in the C account. User 4 is the sole member of the E account and User 5 is the sole member of the F account.</P>
341
<P> Note: accounts A and D do not have any user members in this example, though users could have been assigned.</P>
343
<P> The shares assigned to each account make it easy to determine normalized shares of the machine's complete resources. Account A has .4 normalized shares, B has .3 normalized shares, etc. Users who are sole members of an account have the same number of normalized shares as the account. (E.g., User 1 has .3 normalized shares). Users who share accounts have a portion of the normalized shares based on their shares. For example, if user 2 had been allocated 4 shares instead of 1, user 2 would have had .08 normalized shares. With users 2 and 3 each holding 1 share, they each have a normalized share of 0.05.</P>
345
<P> Users 1, 2, and 4 have run jobs that have consumed the machine's computing resources. User 1's actual usage is 0.2 of the machine; user 2 is 0.25, and user 4 is 0.25.</P>
347
<P> The actual usage charged to each account is represented by the solid arrows. The actual usage charged to each account is summed as one goes up the tree. Account C's usage is the sum of the usage of Users 2 and 3; account A's actual usage is the sum of its children, accounts B and C.</P>
350
<img src=ExampleUsage.gif width=400 ><BR>
351
Figure 3. Fair-share Example
355
<LI> User 1 normalized share: 0.3
356
<LI> User 2 normalized share: 0.05
357
<LI> User 3 normalized share: 0.05
358
<LI> User 4 normalized share: 0.25
359
<LI> User 5 normalized share: 0.35
362
<P> As stated above, the effective usage is computed from the formula:</P>
365
U<sub>E</sub> = U<sub>Achild</sub> +
366
((U<sub>Eparent</sub> - U<sub>Achild</sub>) * S<sub>child</sub>/S<sub>all_siblings</sub>)
369
<P> The effective usage for all accounts at the first tier under the root allocation is always equal to the actual usage:</P>
371
Account A's effective usage is therefore equal to .45. Account D's effective usage is equal to .25.
374
<LI> Account B effective usage: 0.2 + ((0.45 - 0.2) * 30 / 40) = 0.3875
375
<LI> Account C effective usage: 0.25 + ((0.45 - 0.25) * 10 / 40) = 0.3
376
<LI> Account E effective usage: 0.25 + ((0.25 - 0.25) * 25 / 60) = 0.25
377
<LI> Account F effective usage: 0.0 + ((0.25 - 0.0) * 35 / 60) = 0.1458
380
<P> The effective usage of each user is calculated using the same formula:</P>
383
<LI> User 1 effective usage: 0.2 + ((0.3875 - 0.2) * 1 / 1) = 0.3875
384
<LI> User 2 effective usage: 0.25 + ((0.3 - 0.25) * 1 / 2) = 0.275
385
<LI> User 3 effective usage: 0.0 + ((0.3 - 0.0) * 1 / 2) = 0.15
386
<LI> User 4 effective usage: 0.25 + ((0.25 - 0.25) * 1 / 1) = 0.25
387
<LI> User 5 effective usage: 0.0 + ((.1458 - 0.0) * 1 / 1) = 0.1458
390
<P> Using the SLURM fair-share formula,</P>
393
F = (S - U<sub>E</sub> + 1) / 2
396
<P> the fair-share factor for each user is:</P>
399
<LI> User 1 fair-share factor: (.3 - .3875 + 1) / 2 = 0.45625
400
<LI> User 2 fair-share factor: (.05 - .275 + 1) / 2 = 0.3875
401
<LI> User 3 fair-share factor: (.05 - .15 + 1) / 2 = 0.45
402
<LI> User 4 fair-share factor: (.25 - .25 + 1) / 2 = 0.5
403
<LI> User 5 fair-share factor: (.35 - .1458 + 1) / 2 = 0.6021
406
<P> From this example, once can see that users 1,2, and 3 are over-serviced while user 5 is under-serviced. Even though user 3 has yet to submit a job, his/her fair-share factor is negatively influenced by the jobs users 1 and 2 have run.</P>
408
<P> Based on the fair-share factor alone, if all 5 users were to submit a job charging their respective accounts, user 5's job would be granted the highest scheduling priority.</P>
410
<!-------------------------------------------------------------------------->
412
<h2>The <i>sprio</i> utility</h2></a>
414
<P> The <i>sprio</i> command provides a summary of the five factors
415
that comprise each job's scheduling priority. While <i>squeue</i> has
416
format options (%p and %Q) that display a job's composite priority,
417
sprio can be used to display a breakdown of the priority components
418
for each job. In addition, the <i>sprio -w</i> option displays the
419
weights (PriorityWeightAge, PriorityWeightFairshare, etc.) for each
420
factor as it is currently configured.</P>
422
<!-------------------------------------------------------------------------->
424
<h2>Configuration</h2></a>
426
<P> The following slurm.conf (SLURM_CONFIG_FILE) parameters are used to configure the Multi-factor Job Priority Plugin. See slurm.conf(5) man page for more details.</P>
430
<DD> Set this value to "priority/multifactor" to enable the Multi-factor Job Priority Plugin. The default value for this variable is "priority/basic" which enables simple FIFO scheduling.
431
<DT> PriorityDecayHalfLife
432
<DD> This determines the contribution of historical usage on the
433
composite usage value. The higher the number, the longer past usage
434
affects fair-share. If set to 0 no decay will be applied. This is helpful if
435
you want to enforce hard time limits per association. If set to 0
436
PriorityUsageResetPeriod must be set to some interval.
437
The unit is a time string (i.e. min, hr:min:00, days-hr:min:00, or
438
days-hr). The default value is 7-0 (7 days).
439
<DT> PriorityUsageResetPeriod
440
<DD> At this interval the usage of associations will be reset to 0.
441
This is used if you want to enforce hard limits of time usage per
442
association. If PriorityDecayHalfLife is set to be 0 no decay will
443
happen and this is the only way to reset the usage accumulated by
444
running jobs. By default this is turned off and it is advised to
445
use the PriorityDecayHalfLife option to avoid not having anything
446
running on your cluster, but if your schema is set up to only allow
447
certain amounts of time on your system this is the way to do it.
448
The unit is a time string (i.e. min, hr:min:00, days-hr:min:00, or
449
days-hr). The default value is not set (turned off).
451
<DT> PriorityFavorSmall
452
<DD> A boolean that sets the polarity of the job size factor. The
453
default setting is NO which results in larger node sizes having a
454
larger job size factor. Setting this parameter to YES means that
455
the smaller the job, the greater the job size factor will be.
457
<DD> Specifies the queue wait time at which the age factor maxes out.
458
The unit is a time string (i.e. min, hr:min:00, days-hr:min:00, or
459
days-hr). The default value is 7-0 (7 days).
460
<DT> PriorityWeightAge
461
<DD> An unsigned integer that scales the contribution of the age factor.
462
<DT> PriorityWeightFairshare
463
<DD> An unsigned integer that scales the contribution of the fair-share factor.
464
<DT> PriorityWeightJobSize
465
<DD> An unsigned integer that scales the contribution of the job size factor.
466
<DT> PriorityWeightPartition
467
<DD> An unsigned integer that scales the contribution of the partition factor.
468
<DT> PriorityWeightQOS
469
<DD> An unsigned integer that scales the contribution of the quality of service factor.
472
<P> Note: As stated above, the five priority factors range from 0.0 to 1.0. As such, the PriorityWeight terms may need to be set to a high enough value (say, 1000) to resolve very tiny differences in priority factors. This is especially true with the fair-share factor, where two jobs may differ in priority by as little as .001. (or even less!)</P>
474
<!-------------------------------------------------------------------------->
475
<a name=configexample>
476
<h2>Configuration Example</h2></a>
478
<P> The following are sample slurm.conf file settings for the
479
Multi-factor Job Priority Plugin.</P>
481
<P> The first example is for running the plugin applying decay over
482
time to reduce usage. Hard limits can be used in this
483
configuration, but will have less effect since usage will decay
484
over time instead of having no decay over time.</P>
486
# Activate the Multi-factor Job Priority Plugin with decay
487
PriorityType=priority/multifactor
490
PriorityDecayHalfLife=14-0
492
# The larger the job, the greater its job size priority.
493
PriorityFavorSmall=NO
495
# The job's age factor reaches 1.0 after waiting in the
499
# This next group determines the weighting of each of the
500
# components of the Multi-factor Job Priority Plugin.
501
# The default value for each of the following is 1.
502
PriorityWeightAge=1000
503
PriorityWeightFairshare=10000
504
PriorityWeightJobSize=1000
505
PriorityWeightPartition=1000
506
PriorityWeightQOS=0 # don't use the qos factor
509
<P> This example is for running the plugin with no decay on usage,
510
thus making a reset of usage necessary.</P>
512
# Activate the Multi-factor Job Priority Plugin with decay
513
PriorityType=priority/multifactor
516
PriorityDecayHalfLife=0
518
# reset usage after 28 days
519
PriorityUsageResetPeriod=28-0
521
# The larger the job, the greater its job size priority.
522
PriorityFavorSmall=NO
524
# The job's age factor reaches 1.0 after waiting in the
528
# This next group determines the weighting of each of the
529
# components of the Multi-factor Job Priority Plugin.
530
# The default value for each of the following is 1.
531
PriorityWeightAge=1000
532
PriorityWeightFairshare=10000
533
PriorityWeightJobSize=1000
534
PriorityWeightPartition=1000
535
PriorityWeightQOS=0 # don't use the qos factor
538
<!-------------------------------------------------------------------------->
539
<p style="text-align:center;">Last modified 12 June 2009</p>
541
<!--#include virtual="footer.txt"-->