1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN">
4
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=iso-8859-1">
6
<META NAME="GENERATOR" CONTENT="StarOffice/5.2 (Solaris Sparc)">
7
<META NAME="AUTHOR" CONTENT=" ">
8
<META NAME="CREATED" CONTENT="20010608;17024100">
9
<META NAME="CHANGEDBY" CONTENT=" ">
10
<META NAME="CHANGED" CONTENT="20010716;18353200">
13
<H1><FONT SIZE=6 STYLE="font-size: 28pt">Scheduler Daemon - schedd</FONT></H1>
15
<P>This paper introduces the structure of the Grid Engine scheduling
16
daemon. For additional source code documentation on scheduling
17
questions see also the description of the scheduling library
18
<A HREF="../../libs/sched/sched.html">libsched</A>.</P>
19
<H1>Communication scheme schedd-qmaster</H1>
20
<P>The architecture of the scheduling daemon (schedd) is elementarily
21
influenced by the <I>event</I> and <I>order</I> communication scheme
22
between qmaster and schedd. These are the major steps in this
23
communication scheme:</P>
25
<LI><P STYLE="margin-bottom: 0in">Registration of schedd with
29
<P STYLE="margin-bottom: 0in">Schedd initiates communication with
30
qmaster by sending a special GDI (<A HREF="../../libs/gdi/gdi.html">Grid
31
Engine Database Interface</A>) request to the qmaster. This
32
registers schedd as an <I>event client</I> causing the qmaster to
33
send the complete state of everything being relevant for scheduling
34
to schedd. Later on, only state changes need to be passed to schedd.</P>
37
<LI><P STYLE="margin-bottom: 0in">Schedd waits for status updates of
41
<P STYLE="margin-bottom: 0in">A received status update in the for of
42
an event list (see <A HREF="../../libs/cull/cull.html">here</A>
43
for information on the type of lists being used) causes schedd to
44
update it's own data and to send an acknowledge to qmaster for the
45
events. It also triggers the next step.</P>
48
<LI><P STYLE="margin-bottom: 0in">Do a scheduling run
52
<P STYLE="margin-bottom: 0in">A scheduling run begins with making a
53
copy of the schedd internal data. This is done because this data can
54
now be modified within a scheduling run without changing the state
55
resulting from the event update. Then, beginning with the most
56
important job (see the following sections), the scheduler tries to
57
dispatch all pending jobs to queues and generates a so called <I>order</I>
58
for each of these decisions. Besides these <I>dispatch-orders</I>
59
also some other orders are prepared by the scheduler e.g. to
60
implement the so called suspend_thresholds (see the queue_conf(5)
64
<LI><P STYLE="margin-bottom: 0in">Send orders to qmaster</P>
67
<P STYLE="margin-bottom: 0in">The copy of the scheduler private data
68
is free()'d and the a list of orders is sent to qmaster. Once
69
qmaster has acknowledged the orders, schedd starts over at 2 where
70
schedd will receive events again bringing schedd's data in sync with
71
the newest state at qmaster.</P>
73
<H1>The default scheduler</H1>
74
<H2>Priorization of jobs</H2>
75
<P>The order in which runnable jobs are dispatched depends on the
76
policy which was selected by the administrator. Grid Engine can be
77
configured to support two different policy modes, one being the so
78
called SGE mode and the other called SG3E or SGEEE mode (the
79
additional EE stands for Enterprise Edition).</P>
80
<P>In SGE mode the administrator can enable/disable the <I>user_sort</I>
81
parameter in the scheduler configuration (see the sched_conf(5)
82
manual page). With user_sort disabled, the order in which the
83
scheduler examines the jobs is FCFS. With enabled user_sort the job
84
with the lowest amount of already running jobs is selected first and
85
FCFS is in effect only, if two users have the same amount of already
86
running jobs. In both cases the <I>-p</I> priority (see the qsub(1)
87
manual page) overrides FCFS and user_sort/FCFS. An efficient job
88
priorization scheme is implemented in the <I>access tree </I><SPAN STYLE="font-style: normal">module. The <I>access tree</I> (see sge_access_tree.c) can be seen as a
89
complex iterator returning the next job each time a new job is needed
90
in the dispatch cycle.</SPAN></P>
91
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
92
In SG3E mode all jobs receive tickets according to 4 high level
93
policies. The following is just a very brief overview. See the
94
<A HREF="/project/gridengine-download/SGE5_3alpha.pdf">product documentation</A>,
95
in particular the sections about scheduling in the Installation and
96
Administration Guide for more information.</P>
97
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
101
<LI><P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
104
<P STYLE="margin-left: 0.79in; margin-bottom: 0in"><SPAN STYLE="font-weight: medium"><SPAN STYLE="font-style: normal">Individual
105
users and projects have relative entitlements in the overall
106
resources of a Grid Engine cluster, so user A might be entitled to
107
receive 30% of all resources while user B might be entitled to 60%
108
and a user C to 10%. Such entitlements are relative and are meant to
109
be granted over time. They are relative in the sense that if user C
110
never uses its 10% share then those 10% are distributed among user A
111
and user B, but in a proportional 30:60 ratio. The granting over time
112
means that Grid Engine tries to achieve the entitlements over a
113
configurable sliding window of time, e.g. over 2 months. Such
114
entit</SPAN></SPAN>lements are so called long term entitlements. In
115
order to grant them, Grid Engine has to look at the usage of
116
resources which users (or projects) have accumulated in the past and
117
Grid Engine also has to compensate for over- or underutilization of
118
resources as compared to the long term entitlements. Such
119
compensations are done by assigning short term entitlements to the
120
users and projects which can differ from the long term resource
122
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
125
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
126
Long term entitlements are defined in a hierarchical fashion in a so
127
called <I>share tree</I>. The share tree can represent the
128
organizational structure of a site which uses GE in SG3E mode. The
129
leaves of the share tree always are individual projects and/or
130
individual users. The hierarchy level above would be groups of users
131
or projects, then another level above further grouping until the root
132
of the tree which represents the entire organization having access to
134
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
138
<LI><P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
141
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
142
Again, entitlements are defined but based on functional attributes
143
like being a certain user or being associated with a project or a
144
certain user group. The entitlements are static in the sense that
145
Grid Engine does not look at past usage which was accumulated. So
146
there is no need to compensate for under- or overutilization and thus
147
not short term entitlements exist. Functional entitlements represent
148
a kind of relative priority.</P>
149
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
153
<LI><P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
156
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
157
Jobs can be submitted with a definition of a deadline. The
158
entitlement of deadline jobs grows automatically as they approach
160
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
164
<LI><P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
167
<P STYLE="margin-left: 0.79in; margin-bottom: 0in; font-style: normal; font-weight: medium">
168
The automatic policies above can be overridden manually via this
169
policy. It allows, for instance, to double the entitlement which a
170
job (but also an entire project or one user) currently receives. It
171
is also often used to assign a constant level of entitlement to a
172
user, a user group or a project.</P>
173
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
176
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
177
The entitlements assigned to a particular job by each of the 4
178
policies above are translated into so called tickets. Each policy can
179
deliver an amount of tickets to a job for which all those tickets are
180
added together, thereby combining the 4 policies. The job with the
181
highest amount of tickets is investigated by the Grid Engine
182
scheduler first. The <I>-p</I> priority is used as a means for a user
183
to increase the tickets of one job and to decrease the tickets of
185
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
188
<P STYLE="margin-bottom: 0in"><SPAN STYLE="font-weight: medium"><SPAN STYLE="font-style: normal">Note,
189
that the order in which the scheduler attempts to dispatch jobs (be
190
it in SGE or SG3E mode) often is not identical to the order in which
191
jobs are started. The job with the highest priority, for instance,
192
may not fit on any of the resources which are currently available.
193
Hence the second job in line may get started first. The code handling
194
jobs in priority order can be found in <I>dispatch_jobs()</I> in scheduler.c.</P>
195
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
198
<H2>Selection of queues</H2>
199
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
200
Usually there is more than one queue suitable for a certain job. In
201
SGE mode, the <I>queue_sort_method</I> in sched_conf(5) decides which
202
queue is occupied first. If queue_sort_method is defined to be <I>load</I>,
203
then the queue which resides at the host with the lowest load is
204
selected first. The queue sequence number (seq_no in queue_conf(5))
205
is then considered only as a second criterion in case two hosts have
206
the same load index. As opposed to this, if <I>seqno</I> is used as
207
<I>queue_sort_method</I> the queue's sequence number is the first
208
criterion and the load index is the second one, considered only if
209
two queues have the same sequence number. In both cases the
210
<I>load_formula</I> in sched_conf(5) specifies how the load index is
211
computed. In SG3E mode a procedure similar to <I>seqno </I>queue sort
212
method is in effect which considers also running jobs when
213
determining the load index. In both modes, SGE and SG3E, a user can
214
override this selection scheme by using so called <I>-soft </I>resource
215
requests (see the qsub(1) manual page). The code handling queue selection
216
order can be found in <I>sge_replicate_queues_suitable4job()</I> in libs/sched/sge_select_queue.c.</P>
217
<P STYLE="margin-bottom: 0in; font-weight: medium"><BR>
219
<H2>Dispatch strategies</H2>
220
<P STYLE="margin-bottom: 0in; font-style: normal; font-weight: medium">
221
The both chapters "Priorization of Jobs" and "Selection
222
of queues" describe the ruleset which is implemented, but they
223
describe not how this is done. The function
224
<B>sge_replicate_queues_suitable4job() </B>implements two different
225
approaches to find the best suited queue(s) with the lowest effort.
226
For straight sequential jobs the first matching queue is fine. For
227
parallel jobs, however, and for jobs with soft requests the scheduler
228
has to be more sophisticated before a decision can be taken which is
229
usually more expensive.</P>
230
<H1>Alternative Schedulers</H1>
231
<H2>Classifying Alternative Schedulers</H2>
232
<P STYLE="margin-bottom: 0in; font-style: normal">Alternative
233
schedulers can be classified in three different categories and it's
234
worth to become aware of the category of your scheduler, before you
236
<P STYLE="margin-bottom: 0in"><BR>
239
<LI><P STYLE="margin-bottom: 0in">schedulers which do not support
240
the complete set of Grid Engine features</P>
241
<P STYLE="margin-bottom: 0in">It is conceivable to have a scheduling
242
algorithm which dispatches certain jobs in a more intelligent
243
fashion, but (at first) it does not support interactive jobs, for
244
instance. Before a scheduler of this category can be used the
245
administrator must be able to assess whether the lack of certain
246
features is acceptable.</P>
247
<LI><P STYLE="margin-bottom: 0in">schedulers that support the
248
complete Grid Engine features set
250
<P STYLE="margin-bottom: 0in">All types of jobs, all thresholds and
251
anything else is supported by this scheduler, but in a more
252
intelligent fashion. A scheduler of this category could be activated
253
by the administrator of a site without <I>any</I> need to change
254
anything besides 'algorithm' in <SPAN STYLE="font-style: normal">sched_conf(5).</SPAN></P>
255
<LI><P STYLE="margin-bottom: 0in; font-style: normal">schedulers
256
that extend the Grid Engine feature set</P>
259
<P STYLE="margin-bottom: 0in">While this might be a kind of a change
260
that appears straight forward and free of complications at first
261
sight, you should be aware that enhancing the Grid Engine scheduler
262
with new concepts is in many cases not possible without enhancing
263
the data structures with new parameters. This implies the need for
264
administering these new parameters and usually makes modification in
265
the user interfaces necessary. To store the settings of such new
266
parameters to disc, it becomes necessary to change the file format
267
in which qmaster and other components store their state. And
268
changing the file format makes it necessary to think about how to
269
migrate from an existing Grid Engine installation.</P>
271
<H2>Scheduling API</H2>
272
<P STYLE="margin-bottom: 0in">We know that many people are eager to
273
write their own schedulers and hope to find easy-to-use interfaces
274
for this purpose. As of today, there is no interface which at the
276
<P STYLE="margin-bottom: 0in"><BR>
278
<P STYLE="margin-left: 0.79in; margin-bottom: 0in">1. offers access
279
to all potentially interesting information of the system,</P>
280
<P STYLE="margin-left: 0.79in; margin-bottom: 0in">2. has the
281
potential for a well performing scheduler dispatching many thousand
282
jobs in a short time frame, and which</P>
283
<P STYLE="margin-left: 0.79in; margin-bottom: 0in">3. stays stable
284
from release to release.</P>
285
<P STYLE="margin-bottom: 0in"><BR>
287
<P STYLE="margin-bottom: 0in">But it should be possible to find
288
compromises and to design an interface, which is usable for a
289
majority of all problems. The challenge here is to keep the interface
291
<H2>Scheduling Framework</H2>
292
<P STYLE="margin-bottom: 0in"><SPAN STYLE="font-style: normal">The
293
present scheduler framwork supports implementation of alternative
294
schedulers even though a standardized and stable scheduler interface
295
is not yet available. The scheduler code is organized in a way to
296
allow for implementing </SPAN>alternative schedulers in the same
297
manner as the default scheduler was implemented.<SPAN STYLE="font-style: normal">This
298
gives you the ability to reuse the existing framework.</SPAN></P>
299
<P STYLE="margin-bottom: 0in"><BR>
301
<P STYLE="margin-bottom: 0in"><SPAN STYLE="font-style: normal">The
302
first thing you will have to do will be to add your own scheduler
303
function pointer to the <I>sched_funcs</I> table in sge_schedd.c. T</SPAN>wo
304
samples have been added already there to point out the spots to touch
305
when starting with an alternative scheduler. Schedulers implemented
306
in the way suggested by the samples would be able to get activated at
307
schedd runtime by changing the scheduler configuration schedd_conf(5)
308
parameter 'algorithm'.
310
<P STYLE="margin-bottom: 0in"><BR>
312
<P STYLE="margin-bottom: 0in">The picture below classifies the
313
default algorithm and the two code samples:</P>
314
<P STYLE="margin-bottom: 0in"><IMG SRC="schedd_layers.gif" NAME="Graphic1" ALIGN=LEFT WIDTH=745 HEIGHT=511 BORDER=0><BR CLEAR=LEFT><BR>
317
<LI><P STYLE="margin-bottom: 0in">Layer 0 - Common infrastructure</P>
318
<P STYLE="margin-bottom: 0in">The code belonging to this layer
319
starts with <I>main()</I><SPAN STYLE="font-style: normal"> in
320
sge_schedd.c, o<SPAN STYLE="font-style: normal">ther important
321
modules are sge_c_event.c and sge_orders.c.</SPAN> This layer covers
322
anything which has to do with the fundamental infrastructure of
323
schedd, e.g. the event/order protocol with qmaster, daemonizing,
324
<SPAN STYLE="font-style: normal">message logging and switching
325
between </SPAN>different scheduling algorithms. </SPAN>
327
<LI><P STYLE="margin-bottom: 0in; font-style: normal">Layer 1 - Data
329
<P STYLE="margin-bottom: 0in; font-style: normal">The code of the
330
'default' scheduler in this layer starts below
331
<I>event_handler_default_scheduler()</I> in sge_process_events.c.
332
The responsibility of this layer is to keep the data necessary for a
333
scheduling run up-to-date by applying events. There are different
334
possible approaches how to keep this data and our experience is that
335
the way how this data is structured can have a crucial impact on the
336
performance of a scheduling algorithm as it prepares for the
337
decision-making steps. Examples substantiating this experience are
338
<I>job categories</I> and the <I>access tree</I> in sge_category.c
339
and sge_access_tree.c. The conclusion is, that it must be possible
340
to use a different data model than the 'default' schedulers one
341
without breaking the interface.</P>
342
<P STYLE="margin-bottom: 0in; font-style: normal">There is an
343
example showing how to integrate alternative schedulers which are
344
built upon a different data model. The entrance point for the
345
'ext_mysched2' scheduler is <I>event_handler_my_scheduler() </I>and
346
functionality-wise it is identical to the default scheduler as it is
347
simply calls <I>event_handler_default_scheduler()</I>. Run aimk with
348
option -DSCHEDULER_SAMPLES to activate the example. You can
349
implement your own layer 1 functionality by replacing the call to
350
<I>event_handler_default_scheduler()</I><SPAN STYLE="font-style: normal">
351
by your own implementation.</SPAN></P>
352
<LI><P STYLE="margin-bottom: 0in; font-style: normal">Layer 2 -
354
<P STYLE="margin-bottom: 0in; font-style: normal">The code of the
355
'default' scheduler in this layer can be found below <I>scheduler()</I>
356
in scheduler.c. It makes decisions like: Which job should be
357
dispatched to which resource? or How to react on exceeded suspend
358
thresholds? Important functions are
359
<I>sge_replicate_queues_suitable4job()</I> and <I>dispatch_jobs()</I>.
360
Also certain code portions in sge_category.c and sge_access_tree.c
361
must be considered as part of this layer.</P>
362
<P STYLE="margin-bottom: 0in; font-style: normal">There is an
363
example showing how to integrate an alternative scheduler which
364
bases on the same data model as it is used by the 'default'
365
scheduler. The entrance point to the 'ext_mysched' scheduler is
366
<I>my_scheduler() </I>and again it simply calls it's counterpart of
367
the 'default' scheduler <I>scheduler()</I>. Run aimk with option
368
-DSCHEDULER_SAMPLES to activate the example. We suggest starting
369
with an 'ext_mysched'-like scheduler if you try to add your own
370
scheduler to Grid Engine.
372
<LI><P STYLE="margin-bottom: 0in; font-style: normal">Layer 3 -
373
Low-level scheduler service functions</P>
374
<P STYLE="margin-bottom: 0in"><SPAN STYLE="font-style: normal">There
375
are many functions in use in the decision-layer of the 'default'
376
scheduler that hide all the details about jobs and queues. Examples
377
are <I>sge_why_not_job2queue_static(), sge_load_alarm()</I> or
378
<I>sort_host_list()</I><SPAN STYLE="font-style: normal">. For a
379
detailed consideration of these functions have a look at <A HREF="../../libs/sched/sched.html">libsched.</A></SPAN></SPAN></P>
381
<P STYLE="margin-bottom: 0in"><BR>
383
<P STYLE="margin-bottom: 0in; font-style: normal">We expect that some
384
of the functionality that was implemented for the 'default' scheduler
385
can be reused in many different other schedulers and it should be
386
possible to accumulate a collection of well-performing functions
387
shared between different scheduler implementations. But the 'default'
388
scheduler uses also many functions like <I>available_slots_at_queue()
389
</I>that are very specific to the approach used in the 'default'
390
scheduler. In such cases it is better to reuse only parts of the
391
implementation or to use them only as a pattern for an own
393
<P STYLE="margin-bottom: 0in"><BR>
395
<P ALIGN=CENTER STYLE="text-decoration: none">Copyright 2001 Sun
396
Microsystems, Inc. All rights reserved.</P>