~ubuntu-branches/ubuntu/utopic/gridengine/utopic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
<HTML>
<BODY BGCOLOR=white>
<PRE>
<!-- Manpage converted by man2html 3.0.1 -->
NAME
     Grid Engine Checkpointing - the  Grid  Engine  checkpointing
     mechanism and checkpointing support

DESCRIPTION
     Grid Engine supports two levels of checkpointing:  the  user
     level  and  a  operating  system provided transparent level.
     User level checkpointing refers to  applications,  which  do
     their  own checkpointing by writing restart files at certain
     times or algorithmic steps and by properly processing  these
     restart files when restarted.

     Transparent checkpointing has to be provided by the  operat-
     ing system and is usually integrated in the operating system
     kernel. An example for  a  kernel  integrated  checkpointing
     facility is the Hibernator package from Softway for SGI IRIX
     platforms.

     Checkpointing jobs need to be identified to the Grid  Engine
     system by using the -<I>ckpt</I> option of the <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> command. The
     argument to this flag refers to a  so  called  checkpointing
     environment, which defines the attributes of the checkpoint-
     ing method to  be  used  (see  <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>  for  details).
     Checkpointing environments are setup by the <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B> options
     -<I>ackpt</I>, -<I>dckpt</I>, -<I>mckpt</I> and -<I>sckpt</I>. The <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> option -<I>c</I> can
     be  used  to overwrite the <I>when</I> attribute for the referenced
     checkpointing environment.

     If a queue is of the type CHECKPOINTING, jobs need  to  have
     the checkpointing attribute flagged (see the -ckpt option to
     <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>) to be permitted to run in such a queue. As  opposed
     to  the  behavior for regular batch jobs, checkpointing jobs
     are aborted under conditions, for which batch or interactive
     jobs are suspended or even stay unaffected. These conditions
     are:

     <B>o</B>  Explicit suspension of the queue or job  via  <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B>  by
        the  cluster  administration  or  a  queue owner if the <I>x</I>
        occasion specifier (see <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B> -<I>c</I> and <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>) was
        assigned to the job.

     <B>o</B>  A load average value exceeding the migration threshold as
        configured    for    the    corresponding   queues   (see
        <B><A HREF="../htmlman5/queue_conf.html">queue_conf(5)</A></B>).

     <B>o</B>  Shutdown of the Grid Engine execution daemon <B><A HREF="../htmlman8/sge_execd.html">sge_execd(8)</A></B>
        being responsible for the checkpointing job.

     After abortion, the jobs will migrate to other queues unless
     they  were  submitted  to  one specific queue by an explicit
     user request.  The migration of jobs leads to a dynamic load
     balancing.   Note:  The  abortion  of checkpointed jobs will
     free all resources (memory, swap space) which the job  occu-
     pies  at  that  time.  This  is opposed to the situation for
     suspended regular jobs, which still cover swap space.

RESTRICTIONS
     When a job migrates to a queue on another machine at present
     no files are transferred automatically to that machine. This
     means that all files which are used  throughout  the  entire
     job  including  restart files, executables and scratch files
     must be visible  or  transferred  explicitly  (e.g.  at  the
     beginning of the job script).

     There are also some practical limitations regarding  use  of
     disk space for transparently checkpointing jobs. Checkpoints
     of a  transparently  checkpointed  application  are  usually
     stored  in  a  checkpoint file or directory by the operating
     system. The file or directory contains all the  text,  data,
     and  stack space for the process, along with some additional
     control information. This means jobs which use a very  large
     virtual  address  space  will generate very large checkpoint
     files. Also the workstations on which the jobs will actually
     execute  may  have  little  free  disk space. Thus it is not
     always possible to transfer a transparent checkpointing  job
     to  a machine, even though that machine is idle. Since large
     virtual memory jobs must wait for a  machine  that  is  both
     idle,  and  has a sufficient amount of free disk space, such
     jobs may suffer long turnaround times.

SEE ALSO
     <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B>,  <B><A HREF="../htmlman1/qconf.html">qconf(1)</A></B>,  <B><A HREF="../htmlman1/qmod.html">qmod(1)</A></B>,  <B><A HREF="../htmlman1/qsub.html">qsub(1)</A></B>,  <B><A HREF="../htmlman5/checkpoint.html">checkpoint(5)</A></B>,
     <I>Grid</I>  <I>Engine</I>  <I>Installation</I>  <I>and</I>  <I>Administration</I>  <I>Guide</I>, <I>Grid</I>
     <I>Engine</I> <I>User</I>'<I>s</I> <I>Guide</I>

COPYRIGHT
     See <B><A HREF="../htmlman1/sge_intro.html">sge_intro(1)</A></B> for a full statement of rights and  permis-
     sions.
















</PRE>
<HR>
<ADDRESS>
Man(1) output converted with
<a href="http://www.oac.uci.edu/indiv/ehood/man2html.html">man2html</a>
</ADDRESS>
</BODY>
</HTML>