1
<!--#include virtual="header.txt"-->
6
SLURM version 1.2 and earlier supported dedication of resources
7
to jobs based on a simple "first come, first served" policy with backfill.
8
Beginning in SLURM version 1.3, priority-based <I>preemption</I> is supported.
9
Preemption is the act of suspending one or more "low-priority" jobs to let a
10
"high-priority" job run uninterrupted until it completes. Preemption provides
11
the ability to prioritize the workload on a cluster.
14
The SLURM version 1.3.1 <I>sched/gang</I> plugin supports preemption.
16
the plugin monitors each of the partitions in SLURM. If a new job in a
17
high-priority partition has been allocated to resources that have already been
18
allocated to one or more existing jobs from lower priority partitions, the
19
plugin respects the partition priority and suspends the low-priority job(s). The
20
low-priority job(s) remain suspended until the job from the high-priority
21
partition completes. Once the high-priority job completes then the low-priority
25
<H2>Configuration</H2>
27
There are several important configuration parameters relating to preemption:
31
<B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes
32
allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
33
allocated by the <I>select/cons_res</I> plugin.
34
See <A HREF="#future_work">Future Work</A> below for more
35
information on "preemption with consumable resources".
38
<B>SelectTypeParameter</B>: Since resources will be getting overallocated
39
with jobs (the preempted job will remain in memory), the resource selection
40
plugin should be configured to track the amount of memory used by each job to
41
ensure that memory page swapping does not occur. When <I>select/linear</I> is
42
chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
43
<I>select/cons_res</I> is chosen, we recommend including Memory as a resource
44
(ex. <I>SelectTypeParameter=CR_Core_Memory</I>).
47
<B>DefMemPerCPU</B>: Since job requests may not explicitly specify
48
a memory requirement, we also recommend configuring
49
<I>DefMemPerCPU</I> (default memory per allocated CPU) or
50
<I>DefMemPerNode</I> (default memory per allocated node).
51
It may also be desirable to configure
52
<I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
1
<!--#include virtual="header.txt"-->
6
SLURM version 1.2 and earlier supported dedication of resources
7
to jobs based on a simple "first come, first served" policy with backfill.
8
Beginning in SLURM version 1.3, priority partitions and priority-based
9
<I>preemption</I> are supported. Preemption is the act of suspending one or more
10
"low-priority" jobs to let a "high-priority" job run uninterrupted until it
11
completes. Preemption provides the ability to prioritize the workload on a
15
The SLURM version 1.3.1 <I>sched/gang</I> plugin supports preemption.
17
the plugin monitors each of the partitions in SLURM. If a new job in a
18
high-priority partition has been allocated to resources that have already been
19
allocated to one or more existing jobs from lower priority partitions, the
20
plugin respects the partition priority and suspends the low-priority job(s). The
21
low-priority job(s) remain suspended until the job from the high-priority
22
partition completes. Once the high-priority job completes then the low-priority
26
<H2>Configuration</H2>
28
There are several important configuration parameters relating to preemption:
32
<B>SelectType</B>: The SLURM <I>sched/gang</I> plugin supports nodes
33
allocated by the <I>select/linear</I> plugin and socket/core/CPU resources
34
allocated by the <I>select/cons_res</I> plugin.
37
<B>SelectTypeParameter</B>: Since resources will be getting overallocated
38
with jobs (suspended jobs remain in memory), the resource selection
39
plugin should be configured to track the amount of memory used by each job to
40
ensure that memory page swapping does not occur. When <I>select/linear</I> is
41
chosen, we recommend setting <I>SelectTypeParameter=CR_Memory</I>. When
42
<I>select/cons_res</I> is chosen, we recommend including Memory as a resource
43
(ex. <I>SelectTypeParameter=CR_Core_Memory</I>).
46
<B>DefMemPerCPU</B>: Since job requests may not explicitly specify
47
a memory requirement, we also recommend configuring
48
<I>DefMemPerCPU</I> (default memory per allocated CPU) or
49
<I>DefMemPerNode</I> (default memory per allocated node).
50
It may also be desirable to configure
51
<I>MaxMemPerCPU</I> (maximum memory per allocated CPU) or
53
52
<I>MaxMemPerNode</I> (maximum memory per allocated node) in <I>slurm.conf</I>.
54
53
Users can use the <I>--mem</I> or <I>--mem-per-cpu</I> option
55
at job submission time to specify their memory requirements.
58
<B>JobAcctGatherType and JobAcctGatherFrequency</B>:
59
If you wish to enforce memory limits, accounting must be enabled
60
using the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
61
parameters. If accounting is enabled and a job exceeds its configured
62
memory limits, it will be canceled in order to prevent it from
63
adversely effecting other jobs sharing the same resources.
66
<B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
67
<I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>.
70
<B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
71
other partitions to control the preemptive behavior. If two jobs from two
72
different partitions are allocated to the same resources, the job in the
73
partition with the greater <I>Priority</I> value will preempt the job in the
74
partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
75
of the two partitions are equal then no preemption will occur, and the two jobs
76
will run simultaneously on the same resources. The default <I>Priority</I> value
80
<B>Shared</B>: Configure the partitions <I>Shared</I> setting to
81
<I>FORCE</I> for all partitions that will preempt or that will be preempted. The
82
<I>FORCE</I> setting is required to enable the select plugins to overallocate
83
resources. Jobs submitted to a partition that does not share it's resources will
84
not preempt other jobs, nor will those jobs be preempted. Instead those jobs
85
will wait until the resources are free for non-shared use by each job.
87
The <I>FORCE</I> option now supports an additional parameter that controls
88
how many jobs can share a resource within the partition (FORCE[:max_share]). By
89
default the max_share value is 4. To disable timeslicing within a partition but
90
enable preemption with other partitions, set <I>Shared=FORCE:1</I>.
93
<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
94
To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
95
(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
96
to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
97
the overhead of gang scheduling. This parameter is only relevant if timeslicing
98
within a partition will be configured. Preemption and timeslicing can occur at
103
To enable preemption after making the configuration changes described above,
104
restart SLURM if it is already running. Any change to the plugin settings in
105
SLURM requires a full restart of the daemons. If you just change the partition
106
<I>Priority</I> or <I>Shared</I> setting, this can be updated with
107
<I>scontrol reconfig</I>.
110
<H2>Preemption Design and Operation</H2>
113
When enabled, the <I>sched/gang</I> plugin keeps track of the resources
114
allocated to all jobs. For each partition an "active bitmap" is maintained that
115
tracks all concurrently running jobs in the SLURM cluster. Each partition also
116
maintains a job list for that partition, and a list of "shadow" jobs. These
117
"shadow" jobs are running jobs from higher priority partitions that "cast
118
shadows" on the active bitmaps of the lower priority partitions.
121
Each time a new job is allocated to resources in a partition and begins running,
122
the <I>sched/gang</I> plugin adds a "shadow" of this job to all lower priority
123
partitions. The active bitmap of these lower priority partitions are then
124
rebuilt, with the shadow jobs added first. Any existing jobs that were replaced
125
by one or more "shadow" jobs are suspended (preempted). Conversely, when a
126
high-priority running job completes, it's "shadow" goes away and the active
127
bitmaps of the lower priority partitions are rebuilt to see if any suspended
131
The gang scheduler plugin is primarily designed to be <I>reactive</I> to the
132
resource allocation decisions made by the Selector plugins. This is why
133
<I>Shared=FORCE</I> is required in each partition. The <I>Shared=FORCE</I>
134
setting enables the <I>select/linear</I> and <I>select/cons_res</I> plugins to
135
overallocate the resources between partitions. This keeps all of the node
136
placement logic in the <I>select</I> plugins, and leaves the gang scheduler in
137
charge of controlling which jobs should run on the overallocated resources.
140
The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
141
support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to
142
observe the act of preemption is by running <I>watch squeue</I> in a terminal
146
<H2>A Simple Example</H2>
149
The following example is configured with <I>select/linear</I>,
150
<I>sched/gang</I>, and <I>Shared=FORCE:1</I>. This example takes place on a
154
[user@n16 ~]$ <B>sinfo</B>
155
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
156
active* up infinite 5 idle n[12-16]
157
hipri up infinite 5 idle n[12-16]
160
Here are the Partition settings:
163
[user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
164
PartitionName=active Priority=1 Default=YES Shared=FORCE:1 Nodes=n[12-16]
165
PartitionName=hipri Priority=2 Shared=FORCE:1 Nodes=n[12-16]
168
The <I>runit.pl</I> script launches a simple load-generating app that runs
169
for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to
173
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
174
sbatch: Submitted batch job 485
175
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
176
sbatch: Submitted batch job 486
177
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
178
sbatch: Submitted batch job 487
179
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
180
sbatch: Submitted batch job 488
181
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
182
sbatch: Submitted batch job 489
183
[user@n16 ~]$ <B>squeue</B>
184
JOBID PARTITION NAME USER ST TIME NODES NODELIST
185
485 active runit.pl user R 0:06 1 n12
186
486 active runit.pl user R 0:06 1 n13
187
487 active runit.pl user R 0:05 1 n14
188
488 active runit.pl user R 0:05 1 n15
189
489 active runit.pl user R 0:04 1 n16
192
Now submit a short-running 3-node job to the <I>hipri</I> partition:
195
[user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
196
sbatch: Submitted batch job 490
197
[user@n16 ~]$ <B>squeue</B>
198
JOBID PARTITION NAME USER ST TIME NODES NODELIST
199
488 active runit.pl user R 0:29 1 n15
200
489 active runit.pl user R 0:28 1 n16
201
485 active runit.pl user S 0:27 1 n12
202
486 active runit.pl user S 0:27 1 n13
203
487 active runit.pl user S 0:26 1 n14
204
490 hipri runit.pl user R 0:03 3 n[12-14]
207
Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from
208
the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition
212
This state persisted until job 490 completed, at which point the preempted jobs
216
[user@n16 ~]$ <B>squeue</B>
217
JOBID PARTITION NAME USER ST TIME NODES NODELIST
218
485 active runit.pl user R 0:30 1 n12
219
486 active runit.pl user R 0:30 1 n13
220
487 active runit.pl user R 0:29 1 n14
221
488 active runit.pl user R 0:59 1 n15
222
489 active runit.pl user R 0:58 1 n16
226
<H2><A NAME="future_work">Future Work</A></H2>
229
<B>Preemption with consumable resources</B>: This implementation of preemption
230
relies on intelligent job placement by the <I>select</I> plugins. As of SLURM
231
1.3.1 the consumable resource <I>select/cons_res</I> plugin still needs
232
additional enhancements to the job placement algorithm before it's preemption
233
support can be considered "competent". The mechanics of preemption work, but the
234
placement of preemptive jobs relative to any low-priority jobs may not be
235
optimal. The work to improve the placement of preemptive jobs relative to
236
existing jobs is currently in-progress.
239
<B>Requeue a preempted job</B>: In some situations is may be desirable to
240
requeue a low-priority job rather than suspend it. Suspending a job leaves the
241
job in memory. Requeuing a job involves terminating the job and resubmitting it
242
again. This will be investigated at some point in the future. Requeuing a
243
preempted job may make the most sense with <I>Shared=NO</I> partitions.
246
<p style="text-align:center;">Last modified 7 July 2008</p>
248
<!--#include virtual="footer.txt"-->
54
at job submission time to specify their memory requirements.
57
<B>JobAcctGatherType and JobAcctGatherFrequency</B>: The "maximum data segment
58
size" and "maximum virtual memory size" system limits will be configured for
59
each job to ensure that the job does not exceed its requested amount of memory.
60
If you wish to enable additional enforcement of memory limits, configure job
61
accounting with the <I>JobAcctGatherType</I> and <I>JobAcctGatherFrequency</I>
62
parameters. When accounting is enabled and a job exceeds its configured memory
63
limits, it will be canceled in order to prevent it from adversely effecting
64
other jobs sharing the same resources.
67
<B>SchedulerType</B>: Configure the <I>sched/gang</I> plugin by setting
68
<I>SchedulerType=sched/gang</I> in <I>slurm.conf</I>.
71
<B>Priority</B>: Configure the partition's <I>Priority</I> setting relative to
72
other partitions to control the preemptive behavior. If two jobs from two
73
different partitions are allocated to the same resources, the job in the
74
partition with the greater <I>Priority</I> value will preempt the job in the
75
partition with the lesser <I>Priority</I> value. If the <I>Priority</I> values
76
of the two partitions are equal then no preemption will occur. The default
77
<I>Priority</I> value is 1.
80
<B>SchedulerTimeSlice</B>: The default timeslice interval is 30 seconds.
81
To change this duration, set <I>SchedulerTimeSlice</I> to the desired interval
82
(in seconds) in <I>slurm.conf</I>. For example, to set the timeslice interval
83
to one minute, set <I>SchedulerTimeSlice=60</I>. Short values can increase
84
the overhead of gang scheduling. This parameter is only relevant if timeslicing
85
within a partition will be configured. Preemption and timeslicing can occur at
90
To enable preemption after making the configuration changes described above,
91
restart SLURM if it is already running. Any change to the plugin settings in
92
SLURM requires a full restart of the daemons. If you just change the partition
93
<I>Priority</I> or <I>Shared</I> setting, this can be updated with
94
<I>scontrol reconfig</I>.
97
<H2>Preemption Design and Operation</H2>
100
When enabled, the <I>sched/gang</I> plugin keeps track of the resources
101
allocated to all jobs. For each partition an "active bitmap" is maintained that
102
tracks all concurrently running jobs in the SLURM cluster. Each partition also
103
maintains a job list for that partition, and a list of "shadow" jobs. The
104
"shadow" jobs are job allocations from higher priority partitions that "cast
105
shadows" on the active bitmaps of the lower priority partitions. Jobs in lower
106
priority partitions that are caught in these "shadows" will be suspended.
109
Each time a new job is allocated to resources in a partition and begins running,
110
the <I>sched/gang</I> plugin adds a "shadow" of this job to all lower priority
111
partitions. The active bitmap of these lower priority partitions are then
112
rebuilt, with the shadow jobs added first. Any existing jobs that were replaced
113
by one or more "shadow" jobs are suspended (preempted). Conversely, when a
114
high-priority running job completes, it's "shadow" goes away and the active
115
bitmaps of the lower priority partitions are rebuilt to see if any suspended
119
The gang scheduler plugin is designed to be <I>reactive</I> to the resource
120
allocation decisions made by the "select" plugins. The "select" plugins have
121
been enhanced to recognize when "sched/gang" has been configured, and to factor
122
in the priority of each partition when selecting resources for a job. When
123
choosing resources for each job, the selector avoids resources that are in use
124
by other jobs (unless sharing has been configured, in which case it does some
125
load-balancing). However, when "sched/gang" is enabled, the select plugins may
126
choose resources that are already in use by jobs from partitions with a lower
127
priority setting, even when sharing is disabled in those partitions.
130
This leaves the gang scheduler in charge of controlling which jobs should run on
131
the overallocated resources. The <I>sched/gang</I> plugin suspends jobs via the
132
same internal functions that support <I>scontrol suspend</I> and <I>scontrol
133
resume</I>. A good way to observe the act of preemption is by running <I>watch
134
squeue</I> in a terminal window.
137
The <I>sched/gang</I> plugin suspends jobs via the same internal functions that
138
support <I>scontrol suspend</I> and <I>scontrol resume</I>. A good way to
139
observe the act of preemption is by running <I>watch squeue</I> in a terminal
143
<H2>A Simple Example</H2>
146
The following example is configured with <I>select/linear</I> and
147
<I>sched/gang</I>. This example takes place on a cluster of 5 nodes:
150
[user@n16 ~]$ <B>sinfo</B>
151
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
152
active* up infinite 5 idle n[12-16]
153
hipri up infinite 5 idle n[12-16]
156
Here are the Partition settings:
159
[user@n16 ~]$ <B>grep PartitionName /shared/slurm/slurm.conf</B>
160
PartitionName=active Priority=1 Default=YES Shared=NO Nodes=n[12-16]
161
PartitionName=hipri Priority=2 Shared=NO Nodes=n[12-16]
164
The <I>runit.pl</I> script launches a simple load-generating app that runs
165
for the given number of seconds. Submit 5 single-node <I>runit.pl</I> jobs to
169
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
170
sbatch: Submitted batch job 485
171
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
172
sbatch: Submitted batch job 486
173
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
174
sbatch: Submitted batch job 487
175
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
176
sbatch: Submitted batch job 488
177
[user@n16 ~]$ <B>sbatch -N1 ./runit.pl 300</B>
178
sbatch: Submitted batch job 489
179
[user@n16 ~]$ <B>squeue -Si</B>
180
JOBID PARTITION NAME USER ST TIME NODES NODELIST
181
485 active runit.pl user R 0:06 1 n12
182
486 active runit.pl user R 0:06 1 n13
183
487 active runit.pl user R 0:05 1 n14
184
488 active runit.pl user R 0:05 1 n15
185
489 active runit.pl user R 0:04 1 n16
188
Now submit a short-running 3-node job to the <I>hipri</I> partition:
191
[user@n16 ~]$ <B>sbatch -N3 -p hipri ./runit.pl 30</B>
192
sbatch: Submitted batch job 490
193
[user@n16 ~]$ <B>squeue -Si</B>
194
JOBID PARTITION NAME USER ST TIME NODES NODELIST
195
485 active runit.pl user S 0:27 1 n12
196
486 active runit.pl user S 0:27 1 n13
197
487 active runit.pl user S 0:26 1 n14
198
488 active runit.pl user R 0:29 1 n15
199
489 active runit.pl user R 0:28 1 n16
200
490 hipri runit.pl user R 0:03 3 n[12-14]
203
Job 490 in the <I>hipri</I> partition preempted jobs 485, 486, and 487 from
204
the <I>active</I> partition. Jobs 488 and 489 in the <I>active</I> partition
208
This state persisted until job 490 completed, at which point the preempted jobs
212
[user@n16 ~]$ <B>squeue</B>
213
JOBID PARTITION NAME USER ST TIME NODES NODELIST
214
485 active runit.pl user R 0:30 1 n12
215
486 active runit.pl user R 0:30 1 n13
216
487 active runit.pl user R 0:29 1 n14
217
488 active runit.pl user R 0:59 1 n15
218
489 active runit.pl user R 0:58 1 n16
222
<H2><A NAME="future_work">Future Ideas</A></H2>
225
<B>More intelligence in the select plugins</B>: This implementation of
226
preemption relies on intelligent job placement by the <I>select</I> plugins. In
227
SLURM 1.3.1 the <I>select/linear</I> plugin has a decent preemptive placement
228
algorithm, but the consumable resource <I>select/cons_res</I> plugin had no
229
preemptive placement support. In SLURM 1.4 preemptive placement support was
230
added to the <I>select/cons_res</I> plugin, but there is still room for
233
Take the following example:
236
[user@n8 ~]$ <B>sinfo</B>
237
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
238
active* up infinite 5 idle n[1-5]
239
hipri up infinite 5 idle n[1-5]
240
[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
241
sbatch: Submitted batch job 17
242
[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
243
sbatch: Submitted batch job 18
244
[user@n8 ~]$ <B>sbatch -N1 -n2 ./sleepme 60</B>
245
sbatch: Submitted batch job 19
246
[user@n8 ~]$ <B>squeue</B>
247
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
248
17 active sleepme cholmes R 0:03 1 n1
249
18 active sleepme cholmes R 0:03 1 n2
250
19 active sleepme cholmes R 0:02 1 n3
251
[user@n8 ~]$ <B>sbatch -N3 -n6 -p hipri ./sleepme 20</B>
252
sbatch: Submitted batch job 20
253
[user@n8 ~]$ <B>squeue -Si</B>
254
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
255
17 active sleepme cholmes S 0:16 1 n1
256
18 active sleepme cholmes S 0:16 1 n2
257
19 active sleepme cholmes S 0:15 1 n3
258
20 hipri sleepme cholmes R 0:03 3 n[1-3]
259
[user@n8 ~]$ <B>sinfo</B>
260
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
261
active* up infinite 3 alloc n[1-3]
262
active* up infinite 2 idle n[4-5]
263
hipri up infinite 3 alloc n[1-3]
264
hipri up infinite 2 idle n[4-5]
267
It would be more ideal if the "hipri" job were placed on nodes n[3-5], which
268
would allow jobs 17 and 18 to continue running. However, a new "intelligent"
269
algorithm would have to include factors such as job size and required nodes in
270
order to support ideal placements such as this, which can quickly complicate
271
the design. Any and all help is welcome here!
274
<B>Preemptive backfill</B>: the current backfill scheduler plugin
275
("sched/backfill") is a nice way to make efficient use of otherwise idle
276
resources. But SLURM only supports one scheduler plugin at a time. Fortunately,
277
given the design of the new "sched/gang" plugin, there is no direct overlap
278
between the backfill functionality and the gang-scheduling functionality. Thus,
279
it's possible that these two plugins could technically be merged into a new
280
scheduler plugin that supported preemption <U>and</U> backfill. <B>NOTE:</B>
281
this is only an idea based on a code review so there would likely need to be
282
some additional development, and plenty of testing!
287
<B>Requeue a preempted job</B>: In some situations is may be desirable to
288
requeue a low-priority job rather than suspend it. Suspending a job leaves the
289
job in memory. Requeuing a job involves terminating the job and resubmitting it
290
again. The "sched/gang" plugin would need to be modified to recognize when a job
291
is able to be requeued and when it can requeue a job (for preemption only, not
292
for timeslicing!), and perform the requeue request.
295
<p style="text-align:center;">Last modified 5 December 2008</p>
297
<!--#include virtual="footer.txt"-->