~ubuntu-branches/ubuntu/utopic/gridengine/utopic

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
'\" t
.\"___INFO__MARK_BEGIN__
.\"
.\" Copyright: 2004 by Sun Microsystems, Inc.
.\"
.\"___INFO__MARK_END__
.\"
.\" $RCSfile$     Last Update: $Date$     Revision: $Revision$
.\"
.\"
.\" Some handy macro definitions [from Tom Christensen's man(1) manual page].
.\"
.de SB		\" small and bold
.if !"\\$1"" \\s-2\\fB\&\\$1\\s0\\fR\\$2 \\$3 \\$4 \\$5
..
.\"
.de T		\" switch to typewriter font
.ft CW		\" probably want CW if you don't have TA font
..
.\"
.de TY		\" put $1 in typewriter font
.if t .T
.if n ``\c
\\$1\c
.if t .ft P
.if n \&''\c
\\$2
..
.\"
.de M		\" man page reference
\\fI\\$1\\$2\\fR\\|(\\$3)\\$4
..
.TH xxQS_NAMExx_CKPT 1 "$Date$" "xxRELxx" "xxQS_NAMExx User Commands"
.\"
.SH NAME
xxQS_NAMExx Checkpointing \- the xxQS_NAMExx checkpointing mechanism and checkpointing
support
.\"
.SH DESCRIPTION
xxQS_NAMExx
supports two levels of checkpointing: the user level and a operating
system provided transparent
level. User level checkpointing refers to applications, which do their
own checkpointing by writing restart files at certain times or
algorithmic steps and by properly processing these restart files when
restarted.
.PP
Transparent checkpointing has to be provided by the operating system and is 
usually integrated in the operating system kernel. An example for a kernel 
integrated checkpointing facility is the Hibernator package from Softway
for SGI IRIX platforms.
.PP
Checkpointing jobs need to be identified to the xxQS_NAMExx system by using the 
\fI\-ckpt\fP option of the
.M qsub 1
command. The argument to this flag refers to a so 
called checkpointing environment, which defines the attributes of the 
checkpointing method to be used (see
.M checkpoint 5
for details). 
Checkpointing environments are setup by the
.M qconf 1
options \fI\-ackpt\fP, \fI\-dckpt\fP, \fI\-mckpt\fP and \fI\-sckpt\fP. The
.M qsub 1
option \fI\-c\fP can be used to overwrite the \fIwhen\fP
attribute for the referenced checkpointing environment.
.PP
If a queue is of the type CHECKPOINTING, jobs need to have the
checkpointing attribute flagged (see the \fB\-ckpt\fP option to
.M qsub 1 )
to be permitted to run in such a queue. As opposed to the behavior for
regular batch jobs, checkpointing jobs are aborted under conditions,
for which batch or interactive jobs are suspended or even stay
unaffected. These conditions are:
.\"
.IP "\(bu" 3n
Explicit suspension of the queue or job via
.M qmod 1
by the cluster administration or a queue owner
if the \fIx\fP occasion specifier (see
.M qsub 1
\fI\-c\fP and 
.M checkpoint 5 )
was assigned to the job.
.\"
.IP "\(bu" 3n
A load average value exceeding the migration threshold as configured for
the corresponding queues (see
.M queue_conf 5 ).
.\"
.IP "\(bu" 3n
Shutdown of the xxQS_NAMExx execution daemon
.M xxqs_name_sxx_execd 8
being responsible for the checkpointing job.
.PP
.\"
After abortion, the jobs will migrate to other queues unless they were
submitted to one specific queue by an explicit user request.
The migration of jobs leads to a dynamic load balancing.
\fBNote:\fP The abortion of checkpointed jobs will free all resources
(memory, swap space) which the job occupies at that time. This is
opposed to the situation for suspended regular jobs, which still cover
swap space.
.PP
.\"
.\"
.SH RESTRICTIONS
When a job migrates to a queue on another machine at present no files
are transferred automatically to that machine. This means that all files
which are used throughout the entire job including restart files,
executables and scratch files must be visible or transferred explicitly
(e.g. at the beginning of the job script).
.PP
.\"
There are also some practical limitations regarding use of disk space
for transparently checkpointing jobs. Checkpoints of a transparently
checkpointed application are usually stored in a checkpoint file or
directory by the operating system. The file or directory contains all
the text, data, and stack space for the process, along with some
additional control information. This means jobs which use a very large
virtual address space will generate very large checkpoint files. Also
the workstations on which the jobs will actually execute may have
little free disk space. Thus it is not always possible to transfer a
transparent checkpointing job to a machine, even though that machine is
idle. Since large virtual memory jobs must wait for a machine that is
both idle, and has a sufficient amount of free disk space, such jobs
may suffer long turnaround times.
.\"
.SH "SEE ALSO"
.M xxqs_name_sxx_intro 1 ,
.M qconf 1 ,
.M qmod 1 ,
.M qsub 1 ,
.M checkpoint 5 ,
.I xxQS_NAMExx Installation and Administration Guide,
.I xxQS_NAMExx User's Guide
.\"
.SH "COPYRIGHT"
See
.M xxqs_name_sxx_intro 1
for a full statement of rights and permissions.