4
<!-- Manpage converted by man2html 3.0.1 -->
6
checkpoint - Grid Engine checkpointing environment confi-
10
Checkpointing is a facility to save the complete status of
11
an executing program or job and to restore and restart from
12
this so called checkpoint at a later point of time if the
13
original program or job was halted, e.g. through a system
16
Grid Engine provides various levels of checkpointing support
17
(see <B><A HREF="../htmlman1/sge_ckpt.html">sge_ckpt(1)</A></B>). The checkpointing environment described
18
here is a means to configure the different types of check-
19
pointing in use for your Grid Engine cluster or parts
20
thereof. For that purpose you can define the operations
21
which have to be executed in initiating a checkpoint genera-
22
tion, a migration of a checkpoint to another host or a res-
23
tart of a checkpointed application as well as the list of
24
queues which are eligible for a checkpointing method.
26
Supporting different operating systems may easily force Grid
27
Engine to introduce operating system dependencies for the
28
configuration of the checkpointing configuration file and
29
updates of the supported operating system versions may lead
30
to frequently changing implementation details. Please refer
31
to the <sge_root>/ckpt directory for more information.
33
Please use the -<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> or -<I>sckpt</I> options to
34
the <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B> command to manipulate checkpointing environ-
35
ments from the command-line or use the corresponding <B><A HREF="../htmlman1/qmon.html">qmon(1)</A></B>
36
dialogue for X-Windows based interactive configuration.
39
The format of a <I>checkpoint</I> file is defined as follows:
42
The name of the checkpointing environment. To be used in the
43
<B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> -ckpt switch or for the <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B> options mentioned
47
The type of checkpointing to be used. Currently, the follow-
51
The Hibernator kernel level checkpointing is inter-
54
<I>cpr</I> The SGI kernel level checkpointing is used.
56
<I>cray</I>-<I>ckpt</I>
57
The Cray kernel level checkpointing is assumed.
60
Grid Engine assumes that the jobs submitted with refer-
61
ence to this checkpointing interface use a checkpoint-
62
ing library such as provided by the public domain pack-
66
Grid Engine assumes that the jobs submitted with refer-
67
ence to this checkpointing interface perform their
68
private checkpointing method.
70
<I>application</I>-<I>level</I>
71
Uses all of the interface commands configured in the
72
checkpointing object like in the case of one of the
73
kernel level checkpointing interfaces (<I>cpr</I>, <I>cray</I>-<I>ckpt</I>,
74
etc.) except for the restart_command (see below), which
75
is not used (even if it is configured) but the job
76
script is invoked in case of a restart instead.
79
A command-line type command string to be executed by Grid
80
Engine in order to initiate a checkpoint.
83
A command-line type command string to be executed by Grid
84
Engine during a migration of a checkpointing job from one
88
A command-line type command string to be executed by Grid
89
Engine when restarting a previously checkpointed applica-
93
A command-line type command string to be executed by Grid
94
Engine in order to cleanup after a checkpointed application
98
A file system location to which checkpoints of potentially
99
considerable size should be stored.
102
A Unix signal to be sent to a job by Grid Engine to initiate
103
a checkpoint generation. The value for this field can either
104
be a symbolic name from the list produced by the -<I>l</I> option
105
of the <B><A HREF="../htmlman1/kill.html">kill(1)</A></B> command or an integer number which must be a
106
valid signal on the systems used for checkpointing.
110
The points of time when checkpoints are expected to be gen-
111
erated. Valid values for this parameter are composed by the
112
letters <I>s</I>, <I>m</I>, <I>x</I> and <I>r</I> and any combinations thereof without
113
any separating character in between. The same letters are
114
allowed for the -<I>c</I> option of the <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> command which will
115
overwrite the definitions in the used checkpointing environ-
116
ment. The meaning of the letters is defined as follows:
118
<I>s</I> A job is checkpointed, aborted and if possible migrated
119
if the corresponding <B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B> is shut down on the
122
<I>m</I> Checkpoints are generated periodically at the
123
<I>min</I>_<I>cpu</I>_<I>interval</I> interval defined by the queue (see
124
<B><A HREF="../htmlman5/queue_conf.html">queue_conf(5)</A></B>) in which a job executes.
126
<I>x</I> A job is checkpointed, aborted and if possible migrated
127
as soon as the job gets suspended (manually as well as
130
<I>r</I> A job will be rescheduled (not checkpointed) when the
131
host on which the job currently runs went into unknown
132
state and the time interval <I>reschedule</I>_<I>unknown</I> (see
133
<B><A HREF="../htmlman5/sge_conf.html">sge_conf(5)</A></B>) defined in the global/local cluster confi-
134
guration will be exceeded.
138
Note, that the functionality of any checkpointing, migration
139
or restart procedures provided by default with the Grid
140
Engine distribution as well as the way how they are invoked
141
in the <I>ckpt</I>_<I>command</I>, <I>migr</I>_<I>command</I> or <I>restart</I>_<I>command</I> parame-
142
ters of any default checkpointing environments should not be
143
changed or otherwise the functionality remains the full
144
responsibility of the administrator configuring the check-
145
pointing environment. Grid Engine will just invoke these
146
procedures and evaluate their exit status. If the procedures
147
do not perform their tasks properly or are not invoked in a
148
proper fashion, the checkpointing mechanism may behave unex-
149
pectedly, Grid Engine has no means to detect this.
152
<B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B>, <B><A HREF="../htmlman1/sge_ckpt.html">sge_ckpt(1)</A></B>, <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B>, <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B>, <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>,
153
<B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B>.
156
See <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B> for a full statement of rights and permis-
164
Man(1) output converted with
165
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>