1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
|
<HTML>
<BODY BGCOLOR=white>
<PRE>
<!-- Manpage converted by man2html 3.0.1 -->
NAME
Grid Engine Checkpointing - the Grid Engine checkpointing
mechanism and checkpointing support
DESCRIPTION
Grid Engine supports two levels of checkpointing: the user
level and a operating system provided transparent level.
User level checkpointing refers to applications, which do
their own checkpointing by writing restart files at certain
times or algorithmic steps and by properly processing these
restart files when restarted.
Transparent checkpointing has to be provided by the operat-
ing system and is usually integrated in the operating system
kernel. An example for a kernel integrated checkpointing
facility is the Hibernator package from Softway for SGI IRIX
platforms.
Checkpointing jobs need to be identified to the Grid Engine
system by using the -<I>ckpt</I> option of the <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> command. The
argument to this flag refers to a so called checkpointing
environment, which defines the attributes of the checkpoint-
ing method to be used (see <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B> for details).
Checkpointing environments are setup by the <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B> options
-<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> and -<I>sckpt</I>. The <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> option -<I>c</I> can
be used to overwrite the <I>when</I> attribute for the referenced
checkpointing environment.
If a queue is of the type CHECKPOINTING, jobs need to have
the checkpointing attribute flagged (see the -ckpt option to
<B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>) to be permitted to run in such a queue. As opposed
to the behavior for regular batch jobs, checkpointing jobs
are aborted under conditions, for which batch or interactive
jobs are suspended or even stay unaffected. These conditions
are:
<B>o</B> Explicit suspension of the queue or job via <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B> by
the cluster administration or a queue owner if the <I>x</I>
occasion specifier (see <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> -<I>c</I> and <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>) was
assigned to the job.
<B>o</B> A load average value exceeding the migration threshold as
configured for the corresponding queues (see
<B><A HREF="../htmlman5/queue_conf.html">queue_conf(5)</A></B>).
<B>o</B> Shutdown of the Grid Engine execution daemon <B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B>
being responsible for the checkpointing job.
After abortion, the jobs will migrate to other queues unless
they were submitted to one specific queue by an explicit
user request. The migration of jobs leads to a dynamic load
balancing. Note: The abortion of checkpointed jobs will
free all resources (memory, swap space) which the job occu-
pies at that time. This is opposed to the situation for
suspended regular jobs, which still cover swap space.
RESTRICTIONS
When a job migrates to a queue on another machine at present
no files are transferred automatically to that machine. This
means that all files which are used throughout the entire
job including restart files, executables and scratch files
must be visible or transferred explicitly (e.g. at the
beginning of the job script).
There are also some practical limitations regarding use of
disk space for transparently checkpointing jobs. Checkpoints
of a transparently checkpointed application are usually
stored in a checkpoint file or directory by the operating
system. The file or directory contains all the text, data,
and stack space for the process, along with some additional
control information. This means jobs which use a very large
virtual address space will generate very large checkpoint
files. Also the workstations on which the jobs will actually
execute may have little free disk space. Thus it is not
always possible to transfer a transparent checkpointing job
to a machine, even though that machine is idle. Since large
virtual memory jobs must wait for a machine that is both
idle, and has a sufficient amount of free disk space, such
jobs may suffer long turnaround times.
SEE ALSO
<B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B>, <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B>, <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B>, <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>, <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>,
<I>Grid</I> <I>Engine</I> <I>Installation</I> <I>and</I> <I>Administration</I> <I>Guide</I>, <I>Grid</I>
<I>Engine</I> <I>User</I>'<I>s</I> <I>Guide</I>
COPYRIGHT
See <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B> for a full statement of rights and permis-
sions.
</PRE>
<HR>
<ADDRESS>
Man(1) output converted with
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>
</ADDRESS>
</BODY>
</HTML>
|