4
<!-- Manpage converted by man2html 3.0.1 -->
6
Grid Engine Checkpointing - the Grid Engine checkpointing
7
mechanism and checkpointing support
10
Grid Engine supports two levels of checkpointing: the user
11
level and a operating system provided transparent level.
12
User level checkpointing refers to applications, which do
13
their own checkpointing by writing restart files at certain
14
times or algorithmic steps and by properly processing these
15
restart files when restarted.
17
Transparent checkpointing has to be provided by the operat-
18
ing system and is usually integrated in the operating system
19
kernel. An example for a kernel integrated checkpointing
20
facility is the Hibernator package from Softway for SGI IRIX
23
Checkpointing jobs need to be identified to the Grid Engine
24
system by using the -<I>ckpt</I> option of the <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> command. The
25
argument to this flag refers to a so called checkpointing
26
environment, which defines the attributes of the checkpoint-
27
ing method to be used (see <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B> for details).
28
Checkpointing environments are setup by the <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B> options
29
-<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> and -<I>sckpt</I>. The <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> option -<I>c</I> can
30
be used to overwrite the <I>when</I> attribute for the referenced
31
checkpointing environment.
33
If a queue is of the type CHECKPOINTING, jobs need to have
34
the checkpointing attribute flagged (see the -ckpt option to
35
<B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>) to be permitted to run in such a queue. As opposed
36
to the behavior for regular batch jobs, checkpointing jobs
37
are aborted under conditions, for which batch or interactive
38
jobs are suspended or even stay unaffected. These conditions
41
<B>o</B> Explicit suspension of the queue or job via <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B> by
42
the cluster administration or a queue owner if the <I>x</I>
43
occasion specifier (see <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> -<I>c</I> and <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>) was
46
<B>o</B> A load average value exceeding the migration threshold as
47
configured for the corresponding queues (see
48
<B><A HREF="../htmlman5/queue_conf.html">queue_conf(5)</A></B>).
50
<B>o</B> Shutdown of the Grid Engine execution daemon <B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B>
51
being responsible for the checkpointing job.
53
After abortion, the jobs will migrate to other queues unless
54
they were submitted to one specific queue by an explicit
55
user request. The migration of jobs leads to a dynamic load
56
balancing. Note: The abortion of checkpointed jobs will
57
free all resources (memory, swap space) which the job occu-
58
pies at that time. This is opposed to the situation for
59
suspended regular jobs, which still cover swap space.
62
When a job migrates to a queue on another machine at present
63
no files are transferred automatically to that machine. This
64
means that all files which are used throughout the entire
65
job including restart files, executables and scratch files
66
must be visible or transferred explicitly (e.g. at the
67
beginning of the job script).
69
There are also some practical limitations regarding use of
70
disk space for transparently checkpointing jobs. Checkpoints
71
of a transparently checkpointed application are usually
72
stored in a checkpoint file or directory by the operating
73
system. The file or directory contains all the text, data,
74
and stack space for the process, along with some additional
75
control information. This means jobs which use a very large
76
virtual address space will generate very large checkpoint
77
files. Also the workstations on which the jobs will actually
78
execute may have little free disk space. Thus it is not
79
always possible to transfer a transparent checkpointing job
80
to a machine, even though that machine is idle. Since large
81
virtual memory jobs must wait for a machine that is both
82
idle, and has a sufficient amount of free disk space, such
83
jobs may suffer long turnaround times.
86
<B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B>, <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B>, <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B>, <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>, <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>,
87
<I>Grid</I> <I>Engine</I> <I>Installation</I> <I>and</I> <I>Administration</I> <I>Guide</I>, <I>Grid</I>
88
<I>Engine</I> <I>User</I>'<I>s</I> <I>Guide</I>
91
See <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B> for a full statement of rights and permis-
112
Man(1) output converted with
113
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>