~ubuntu-branches/ubuntu/vivid/slurm-llnl/vivid : contents of doc/html/faq.shtml at revision 14

~ubuntu-branches/ubuntu/vivid/slurm-llnl/vivid : (revision 14)
<!--#include virtual="header.txt"-->

<h1><a name="top">Frequently Asked Questions</a></h1>
<h2>For Users</h2>
<ol>
<li><a href="#comp">Why is my job/node in COMPLETING state?</a></li>
<li><a href="#rlimit">Why are my resource limits not propagated?</a></li>
<li><a href="#pending">Why is my job not running?</a></li>
<li><a href="#sharing">Why does the srun --overcommit option not permit multiple jobs 
to run on nodes?</a></li>
<li><a href="#purge">Why is my job killed prematurely?</a></li>
<li><a href="#opts">Why are my srun options ignored?</a></li>
<li><a href="#backfill">Why is the SLURM backfill scheduler not starting my 
job?</a></li>
<li><a href="#steps">How can I run multiple jobs from within a single script?</a></li>
<li><a href="#orphan">Why do I have job steps when my job has already COMPLETED?</a></li>
<li><a href="#multi_batch">How can I run a job within an existing job allocation?</a></li>
<li><a href="#user_env">How does SLURM establish the environment for my job?</a></li>
<li><a href="#prompt">How can I get shell prompts in interactive mode?</a></li>
<li><a href="#batch_out">How can I get the task ID in the output or error file
name for a batch job?</a></li>
<li><a href="#parallel_make">Can the <i>make</i> command utilize the resources 
allocated to a SLURM job?</a></li>
<li><a href="#terminal">Can tasks be launched with a remote terminal?</a></li>
<li><a href="#force">What does &quot;srun: Force Terminated job&quot; indicate?</a></li>
<li><a href="#early_exit">What does this mean: &quot;srun: First task exited 
30s ago&quot; followed by &quot;srun Job Failed&quot;?</a></li>
<li><a href="#memlock">Why is my MPI job  failing due to the locked memory 
(memlock) limit being too low?</a></li>
<li><a href="#inactive">Why is my batch job that launches no job steps being 
killed?</a></li>
<li><a href="#arbitrary">How do I run specific tasks on certain nodes
in my allocation?</a></li> 
<li><a href="#hold">How can I temporarily prevent a job from running 
(e.g. place it into a <i>hold</i> state)?</a></li>
</ol>

<h2>For Administrators</h2>
<ol>
<li><a href="#suspend">How is job suspend/resume useful?</a></li>
<li><a href="#fast_schedule">How can I configure SLURM to use the resources actually 
found on a node rather than what is defined in <i>slurm.conf</i>?</a></li>
<li><a href="#return_to_service">Why is a node shown in state DOWN when the node 
has registered for service?</a></li>
<li><a href="#down_node">What happens when a node crashes?</a></li>
<li><a href="#multi_job">How can I control the execution of multiple 
jobs per node?</a></li>
<li><a href="#inc_plugin">When the SLURM daemon starts, it prints
&quot;cannot resolve X plugin operations&quot; and exits. What does this mean?</a></li>
<li><a href="#sigpipe">Why are user tasks intermittently dying at launch with SIGPIPE
error messages?</a></li>
<li><a href="#maint_time">How can I dry up the workload for a maintenance 
period?</a></li>
<li><a href="#pam">How can PAM be used to control a user's limits on or 
access to compute nodes?</a></li>
<li><a href="#time">Why are jobs allocated nodes and then unable to initiate 
programs on some nodes?</a></li>
<li><a href="#ping"> Why does <i>slurmctld</i> log that some nodes
are not responding even if they are not in any partition?</a></li>
<li><a href="#controller"> How should I relocated the primary or backup
controller?</a></li>
<li><a href="#multi_slurm">Can multiple SLURM systems be run in 
parallel for testing purposes?</a></li>
<li><a href="#multi_slurmd">Can slurm emulate a larger cluster?</a></li>
<li><a href="#extra_procs">Can SLURM emulate nodes with more 
resources than physically exist on the node?</a></li>
<li><a href="#credential_replayed">What does a 
&quot;credential replayed&quot; error in the <i>SlurmdLogFile</i> 
indicate?</a></li>
<li><a href="#large_time">What does 
&quot;Warning: Note very large processing time&quot; 
in the <i>SlurmctldLogFile</i> indicate?</a></li>
<li><a href="#lightweight_core">How can I add support for lightweight
core files?</a></li>
<li><a href="#limit_propagation">Is resource limit propagation
useful on a homogeneous cluster?</a></li>
<li<a href="#clock">Do I need to maintain synchronized clocks 
on the cluster?</a></li>
<li><a href="#cred_invalid">Why are &quot;Invalid job credential&quot; errors 
generated?</a></li>
<li><a href="#cred_replay">Why are 
&quot;Task launch failed on node ... Job credential replayed&quot; 
errors generated?</a></li>
<li><a href="#globus">Can SLURM be used with Globus?</li>
<li><a href="#time_format">Can SLURM time output format include the year?</li>
<li><a href="#file_limit">What causes the error 
&quot;Unable to accept new connection: Too many open files&quot;?</li>
<li><a href="#slurmd_log">Why does the setting of <i>SlurmdDebug</i> fail 
to log job step information at the appropriate level?</li>
<li><a href="#rpm">Why isn't the auth_none.so (or other file) in a 
SLURM RPM?</li>
<li><a href="#slurmdbd">Why should I use the slurmdbd instead of the
regular database plugins?</li>
<li><a href="#debug">How can I build SLURM with debugging symbols?</li>
<li><a href="#state_preserve">How can I easily preserve drained node 
information between major SLURM updates?</li>
<li><a href="#health_check">Why doesn't the <i>HealthCheckProgram</i>
execute on DOWN nodes?</li>
<li><a href="#batch_lost">What is the meaning of the error 
&quot;Batch JobId=# missing from master node, killing it&quot;?</a></li>
<li><a href="#accept_again">What does the messsage
&quot;srun: error: Unable to accept connection: Resources temporarily unavailable&quot; 
indicate?</a></li>
<li><a href="#task_prolog">How could I automatically print a job's 
SLURM job ID to its standard output?</li>
<li><a href="#moab_start">I run SLURM with the Moab or Maui scheduler.
How can I start a job under SLURM without the scheduler?</li>
<li><a href="#orphan_procs">Why are user processes and <i>srun</i>
running even though the job is supposed to be completed?</li>
<li><a href="#slurmd_oom">How can I prevent the <i>slurmd</i> and
<i>slurmstepd</i> daemons from being killed when a node's memory 
is exhausted?</li>
<li><a href="#ubuntu">I see my host of my calling node as 127.0.1.1
    instead of the correct ip address.  Why is that?</a></li>
</ol>


<h2>For Users</h2>
<p><a name="comp"><b>1. Why is my job/node in COMPLETING state?</b></a><br>
When a job is terminating, both the job and its nodes enter the COMPLETING state. 
As the SLURM daemon on each node determines that all processes associated with 
the job have terminated, that node changes state to IDLE or some other appropriate 
state for use by other jobs. 
When every node allocated to a job has determined that all processes associated 
with it have terminated, the job changes state to COMPLETED or some other 
appropriate state (e.g. FAILED). 
Normally, this happens within a second. 
However, if the job has processes that cannot be terminated with a SIGKILL
signal, the job and one or more nodes can remain in the COMPLETING state 
for an extended period of time. 
This may be indicative of processes hung waiting for a core file 
to complete I/O or operating system failure. 
If this state persists, the system administrator should check for processes 
associated with the job that cannot be terminated then use the 
<span class="commandline">scontrol</span> command to change the node's 
state to DOWN (e.g. &quot;scontrol update NodeName=<i>name</i> State=DOWN Reason=hung_completing&quot;), 
reboot the node, then reset the node's state to IDLE 
(e.g. &quot;scontrol update NodeName=<i>name</i> State=RESUME&quot;).
Note that setting the node DOWN will terminate all running or suspended 
jobs associated with that node. 
An alternative is to set the node's state to DRAIN until all jobs 
associated with it terminate before setting it DOWN and re-booting.</p>
<p>Note that SLURM has two configuration parameters that may be used to 
automate some of this process.
<i>UnkillableStepProgram</i> specifies a program to execute when 
non-killable processes are identified.
<i>UnkillableStepTimeout</i> specifies how long to wait for processes
to terminate. 
See the "man slurm.conf" for more information about these parameters.</p>

<p><a name="rlimit"><b>2. Why are my resource limits not propagated?</b></a><br>
When the <span class="commandline">srun</span> command executes, it captures the 
resource limits in effect at submit time. These limits are propagated to the allocated 
nodes before initiating the user's job. The SLURM daemon running on that node then 
tries to establish identical resource limits for the job being initiated. 
There are several possible reasons for not being able to establish those 
resource limits.
<ul> 
<li>The hard resource limits applied to SLURM's slurmd daemon are lower 
than the user's soft resources limits on the submit host. Typically 
the slurmd daemon is initiated by the init daemon with the operating 
system default limits. This may be addressed either through use of the 
ulimit command in the /etc/sysconfig/slurm file or enabling
<a href="#pam">PAM in SLURM</a>.</li>
<li>The user's hard resource limits on the allocated node are lower than 
the same user's soft hard resource limits on the node from which the 
job was submitted. It is recommended that the system administrator 
establish uniform hard resource limits for users on all nodes 
within a cluster to prevent this from occurring.</li>
</ul></p>
<p>NOTE: This may produce the error message &quot;Can't propagate RLIMIT_...&quot;.
The error message is printed only if the user explicitly specifies that
the resource limit should be propagated or the srun command is running
with verbose logging of actions from the slurmd daemon (e.g. "srun -d6 ...").</p>

<p><a name="pending"><b>3. Why is my job not running?</b></a><br>
The answer to this question depends upon the scheduler used by SLURM. Executing 
the command</p>
<blockquote> 
<p> <span class="commandline">scontrol show config | grep SchedulerType</span></p>
</blockquote>
<p> will supply this information. If the scheduler type is <b>builtin</b>, then 
jobs will be executed in the order of submission for a given partition. Even if 
resources are available to initiate your job immediately, it will be deferred 
until no previously submitted job is pending. If the scheduler type is <b>backfill</b>, 
then jobs will generally be executed in the order of submission for a given partition 
with one exception: later submitted jobs will be initiated early if doing so does 
not delay the expected execution time of an earlier submitted job. In order for 
backfill scheduling to be effective, users' jobs should specify reasonable time
limits. If jobs do not specify time limits, then all jobs will receive the same 
time limit (that associated with the partition), and the ability to backfill schedule 
jobs will be limited. The backfill scheduler does not alter job specifications 
of required or excluded nodes, so jobs which specify nodes will substantially 
reduce the effectiveness of backfill scheduling. See the <a href="#backfill">
backfill</a> section for more details. If the scheduler type is <b>wiki</b>, 
this represents 
<a href="http://www.clusterresources.com/pages/products/maui-cluster-scheduler.php">
The Maui Scheduler</a> or 
<a href="http://www.clusterresources.com/pages/products/moab-cluster-suite.php">
Moab Cluster Suite</a>. 
Please refer to its documentation for help. For any scheduler, you can check priorities 
of jobs using the command <span class="commandline">scontrol show job</span>.</p>

<p><a name="sharing"><b>4. Why does the srun --overcommit option not permit multiple jobs 
to run on nodes?</b></a><br>
The <b>--overcommit</b> option is a means of indicating that a job or job step is willing 
to execute more than one task per processor in the job's allocation. For example, 
consider a cluster of two processor nodes. The srun execute line may be something 
of this sort</p>
<blockquote>
<p><span class="commandline">srun --ntasks=4 --nodes=1 a.out</span></p>
</blockquote>
<p>This will result in not one, but two nodes being allocated so that each of the four 
tasks is given its own processor. Note that the srun <b>--nodes</b> option specifies 
a minimum node count and optionally a maximum node count. A command line of</p>
<blockquote>
<p><span class="commandline">srun --ntasks=4 --nodes=1-1 a.out</span></p>
</blockquote>
<p>would result in the request being rejected. If the <b>--overcommit</b> option 
is added to either command line, then only one node will be allocated for all 
four tasks to use.</p>
<p>More than one job can execute simultaneously on the same nodes through the use 
of srun's <b>--shared</b> option in conjunction with the <b>Shared</b> parameter 
in SLURM's partition configuration. See the man pages for srun and slurm.conf for 
more information.</p>

<p><a name="purge"><b>5. Why is my job killed prematurely?</b></a><br>
SLURM has a job purging mechanism to remove inactive jobs (resource allocations)
before reaching its time limit, which could be infinite.
This inactivity time limit is configurable by the system administrator. 
You can check its value with the command</p>
<blockquote>
<p><span class="commandline">scontrol show config | grep InactiveLimit</span></p>
</blockquote>
<p>The value of InactiveLimit is in seconds. 
A zero value indicates that job purging is disabled. 
A job is considered inactive if it has no active job steps or if the srun 
command creating the job is not responding.
In the case of a batch job, the srun command terminates after the job script 
is submitted. 
Therefore batch job pre- and post-processing is limited to the InactiveLimit.
Contact your system administrator if you believe the InactiveLimit value 
should be changed. 

<p><a name="opts"><b>6. Why are my srun options ignored?</b></a><br>
Everything after the command <span class="commandline">srun</span> is 
examined to determine if it is a valid option for srun. The first 
token that is not a valid option for srun is considered the command 
to execute and everything after that is treated as an option to 
the command. For example:</p>
<blockquote>
<p><span class="commandline">srun -N2 hostname -pdebug</span></p>
</blockquote>
<p>srun processes "-N2" as an option to itself. "hostname" is the 
command to execute and "-pdebug" is treated as an option to the 
hostname command. This will change the name of the computer 
on which SLURM executes the command - Very bad, <b>Don't run 
this command as user root!</b></p>

<p><a name="backfill"><b>7. Why is the SLURM backfill scheduler not starting my job?
</b></a><br>
There are significant limitations in the current backfill scheduler plugin. 
It was designed to perform backfill node scheduling for a homogeneous cluster.
It does not manage scheduling on individual processors (or other consumable 
resources). It does not update the required or excluded node list of 
individual jobs. It does support job's with constraints/features unless 
the exclusive OR operator is used in the constraint expression. 
You can use the scontrol show command to check if these conditions apply.</p> 
<ul>
<li>Partition: State=UP</li>
<li>Partition: RootOnly=NO</li>
<li>Partition: Shared=NO</li>
<li>Job: ReqNodeList=NULL</li>
<li>Job: ExcNodeList=NULL</li>
<li>Job: Contiguous=0</li>
<li>Job: Features=NULL</li>
<li>Job: MinProcs, MinMemory, and MinTmpDisk satisfied by all nodes in 
the partition</li>
<li>Job: MinProcs or MinNodes not to exceed partition's MaxNodes</li>
</ul>
<p>If the partitions specifications differ from those listed above, 
no jobs in that partition will be scheduled by the backfills scheduler. 
Their jobs will only be scheduled on a First-In-First-Out (FIFO) basis.</p>
<p>Jobs failing to satisfy the requirements above (i.e. with specific 
node requirements) will not be considered candidates for backfill 
scheduling and other jobs may be scheduled ahead of these jobs. 
These jobs are subject to starvation, but will not block other 
jobs from running when sufficient resources are available for them.</p>

<p><a name="steps"><b>8. How can I run multiple jobs from within a 
single script?</b></a><br>
A SLURM job is just a resource allocation. You can execute many 
job steps within that allocation, either in parallel or sequentially. 
Some jobs actually launch thousands of job steps this way. The job 
steps will be allocated nodes that are not already allocated to 
other job steps. This essential provides a second level of resource 
management within the job for the job steps.</p>

<p><a name="orphan"><b>9. Why do I have job steps when my job has 
already COMPLETED?</b></a><br>
NOTE: This only applies to systems configured with 
<i>SwitchType=switch/elan</i> or <i>SwitchType=switch/federation</i>.
All other systems will purge all job steps on job completion.</p>
<p>SLURM maintains switch (network interconnect) information within 
the job step for Quadrics Elan and IBM Federation switches. 
This information must be maintained until we are absolutely certain 
that the processes associated with the switch have been terminated 
to avoid the possibility of re-using switch resources for other 
jobs (even on different nodes).
SLURM considers jobs COMPLETED when all nodes allocated to the 
job are either DOWN or confirm termination of all its processes.
This enables SLURM to purge job information in a timely fashion 
even when there are many failing nodes.
Unfortunately the job step information may persist longer.</p>

<p><a name="multi_batch"><b>10. How can I run a job within an existing
job allocation?</b></a><br>
There is a srun option <i>--jobid</i> that can be used to specify 
a job's ID. 
For a batch job or within an existing resource allocation, the 
environment variable <i>SLURM_JOB_ID</i> has already been defined, 
so all job steps will run within that job allocation unless 
otherwise specified.
The one exception to this is when submitting batch jobs. 
When a batch job is submitted from within an existing batch job, 
it is treated as a new job allocation request and will get a 
new job ID unless explicitly set with the <i>--jobid</i> option. 
If you specify that a batch job should use an existing allocation, 
that job allocation will be released upon the termination of 
that batch job.</p>

<p><a name="user_env"><b>11. How does SLURM establish the environment 
for my job?</b></a><br>
SLURM processes are not run under a shell, but directly exec'ed 
by the <i>slurmd</i> daemon (assuming <i>srun</i> is used to launch 
the processes).
The environment variables in effect at the time the <i>srun</i> command 
is executed are propagated to the spawned processes. 
The <i>~/.profile</i> and <i>~/.bashrc</i> scripts are not executed 
as part of the process launch.</p>

<p><a name="prompt"><b>12. How can I get shell prompts in interactive 
mode?</b></a><br>
<i>srun -u bash -i</i><br>
Srun's <i>-u</i> option turns off buffering of stdout.
Bash's <i>-i</i> option tells it to run in interactive mode (with prompts).

<p><a name="batch_out"><b>13. How can I get the task ID in the output 
or error file name for a batch job?</b></a><br>
<p>If you want separate output by task, you will need to build a script 
containing this specification. For example:</p>
<pre>
$ cat test
#!/bin/sh
echo begin_test
srun -o out_%j_%t hostname

$ sbatch -n7 -o out_%j test
sbatch: Submitted batch job 65541

$ ls -l out*
-rw-rw-r--  1 jette jette 11 Jun 15 09:15 out_65541
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_0
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_1
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_2
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_3
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_4
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_5
-rw-rw-r--  1 jette jette  6 Jun 15 09:15 out_65541_6

$ cat out_65541
begin_test

$ cat out_65541_2
tdev2
</pre>

<p><a name="parallel_make"><b>14. Can the <i>make</i> command
utilize the resources allocated to a SLURM job?</b></a><br>
Yes. There is a patch available for GNU make version 3.81 
available as part of the SLURM distribution in the file 
<i>contribs/make.slurm.patch</i>. 
This patch will use SLURM to launch tasks across a job's current resource
allocation. Depending upon the size of modules to be compiled, this may
or may not improve performance. If most modules are thousands of lines
long, the use of additional resources should more than compensate for the
overhead of SLURM's task launch. Use with make's <i>-j</i> option within an
existing SLURM allocation. Outside of a SLURM allocation, make's behavior
will be unchanged.</p>

<p><a name="terminal"><b>15. Can tasks be launched with a remote 
terminal?</b></a><br>
In SLURM version 1.3 or higher, use srun's <i>--pty</i> option.
Until then, you can accomplish this by starting an appropriate program 
or script. In the simplest case (X11 over TCP with the DISPLAY 
environment already set), executing <i>srun xterm</i> may suffice. 
In the more general case, the following scripts should work. 
<b>NOTE: The pathname to the additional scripts are included in the 
variables BS and IS of the first script. You must change this in the 
first script.</b>
Execute the script with the sbatch options desired.
For example, <i>interactive -N2 -pdebug</i>.

<pre>
#!/bin/bash
# -*- coding: utf-8 -*-
# Author: P&auml;r Andersson (National Supercomputer Centre, Sweden)
# Version: 0.3 2007-07-30
# 
# This will submit a batch script that starts screen on a node. 
# Then ssh is used to connect to the node and attach the screen. 
# The result is very similar to an interactive shell in PBS 
# (qsub -I)

# Batch Script that starts SCREEN
BS=/INSTALL_DIRECTORY/_interactive
# Interactive screen script
IS=/INSTALL_DIRECTORY/_interactive_screen

# Submit the job and get the job id
JOB=`sbatch --output=/dev/null --error=/dev/null $@ $BS 2>&1 \
    | egrep -o -e "\b[0-9]+$"`

# Make sure the job is always canceled
trap "{ /usr/bin/scancel -q $JOB; exit; }" SIGINT SIGTERM EXIT

echo "Waiting for JOBID $JOB to start"
while true;do
    sleep 5s

    # Check job status
    STATUS=`squeue -j $JOB -t PD,R -h -o %t`
    
    if [ "$STATUS" = "R" ];then
	# Job is running, break the while loop
	break
    elif [ "$STATUS" != "PD" ];then
	echo "Job is not Running or Pending. Aborting"
	scancel $JOB
	exit 1
    fi
    
    echo -n "."
    
done

# Determine the first node in the job:
NODE=`srun --jobid=$JOB -N1 hostname`

# SSH to the node and attach the screen
sleep 1s
ssh -X -t $NODE $IS slurm$JOB
# The trap will now cancel the job before exiting.
</pre>

<p>NOTE: The above script executes the script below, 
named <i>_interactive</i>.</p>
<pre>
#!/bin/sh
# -*- coding: utf-8 -*-
# Author: P&auml;r Andersson  (National Supercomputer Centre, Sweden)
# Version: 0.2 2007-07-30
# 
# Simple batch script that starts SCREEN.

exec screen -Dm -S slurm$SLURM_JOB_ID
</pre>

<p>The following script named <i>_interactive_screen</i> is also used.</p>
<pre>
#!/bin/sh
# -*- coding: utf-8 -*-
# Author: P&auml;r Andersson  (National Supercomputer Centre, Sweden)
# Version: 0.3 2007-07-30
#

SCREENSESSION=$1

# If DISPLAY is set then set that in the screen, then create a new
# window with that environment and kill the old one.
if [ "$DISPLAY" != "" ];then
    screen -S $SCREENSESSION -X unsetenv DISPLAY
    screen -p0 -S $SCREENSESSION -X setenv DISPLAY $DISPLAY
    screen -p0 -S $SCREENSESSION -X screen
    screen -p0 -S $SCREENSESSION -X kill
fi

exec screen -S $SCREENSESSION -rd
</pre>

<p><a name="force"><b>16. What does &quot;srun: Force Terminated job&quot; 
indicate?</b></a><br>
The srun command normally terminates when the standard output and 
error I/O from the spawned tasks end. This does not necessarily 
happen at the same time that a job step is terminated. For example, 
a file system problem could render a spawned task non-killable
at the same time that I/O to srun is pending. Alternately a network 
problem could prevent the I/O from being transmitted to srun.
In any event, the srun command is notified when a job step is 
terminated, either upon reaching its time limit or being explicitly 
killed. If the srun has not already terminated, the message 
&quot;srun: Force Terminated job&quot; is printed. 
If the job step's I/O does not terminate in a timely fashion
thereafter, pending I/O is abandoned and the srun command 
exits.</p>

<p><a name="early_exit"><b>17. What does this mean: 
&quot;srun: First task exited 30s ago&quot;
followed by &quot;srun Job Failed&quot;?</b></a><br>
The srun command monitors when tasks exit. By default, 30 seconds 
after the first task exists, the job is killed. 
This typically indicates some type of job failure and continuing
to execute a parallel job when one of the tasks has exited is
not normally productive. This behavior can be changed using srun's
<i>--wait=&lt;time&gt;</i> option to either change the timeout
period or disable the timeout altogether. See srun's man page
for details.</p>

<p><a name="memlock"><b>18. Why is my MPI job  failing due to the 
locked memory (memlock) limit being too low?</b></a><br>
By default, SLURM propagates all of your resource limits at the 
time of job submission to the spawned tasks. 
This can be disabled by specifically excluding the propagation of
specific limits in the <i>slurm.conf</i> file. For example
<i>PropagateResourceLimitsExcept=MEMLOCK</i> might be used to 
prevent the propagation of a user's locked memory limit from a 
login node to a dedicated node used for his parallel job.
If the user's resource limit is not propagated, the limit in 
effect for the <i>slurmd</i> daemon will be used for the spawned job.
A simple way to control this is to insure that user <i>root</i> has a 
sufficiently large resource limit and insuring that <i>slurmd</i> takes 
full advantage of this limit. For example, you can set user root's
locked memory limit ulimit to be unlimited on the compute nodes (see
<i>"man limits.conf"</i>) and insuring that <i>slurmd</i> takes 
full advantage of this limit (e.g. by adding something like
<i>"ulimit -l unlimited"</i> to the <i>/etc/init.d/slurm</i>
script used to initiate <i>slurmd</i>). 
Related information about <a href="#pam">PAM</a> is also available.</p>

<p><a name="inactive"><b>19. Why is my batch job that launches no 
job steps being killed?</b></a><br>
SLURM has a configuration parameter <i>InactiveLimit</i> intended 
to kill jobs that do not spawn any job steps for a configurable
period of time. Your system administrator may modify the <i>InactiveLimit</i>
to satisfy your needs. Alternately, you can just spawn a job step
at the beginning of your script to execute in the background. It
will be purged when your script exits or your job otherwise terminates.
A line of this sort near the beginning of your script should suffice:<br>
<i>srun -N1 -n1 sleep 999999 &</i></p>

<p><a name="arbitrary"><b>20. How do I run specific tasks on certain nodes
in my allocation?</b></a><br>
One of the distribution methods for srun '<b>-m</b>
or <b>--distribution</b>' is 'arbitrary'.  This means you can tell slurm to  
layout your tasks in any fashion you want.  For instance if I had an
allocation of 2 nodes and wanted to run 4 tasks on the first node and
1 task on the second and my nodes allocated from SLURM_NODELIST
where tux[0-1] my srun line would look like this.<p>
<i>srun -n5 -m arbitrary -w tux[0,0,0,0,1] hostname</i><p>
If I wanted something similar but wanted the third task to be on tux 1
I could run this...<p>
<i>srun -n5 -m arbitrary -w tux[0,0,1,0,0] hostname</i><p> 
Here is a simple perl script named arbitrary.pl that can be ran to easily lay
out tasks on nodes as they are in SLURM_NODELIST<p>
<pre>
#!/usr/bin/perl
my @tasks = split(',', $ARGV[0]);
my @nodes = `scontrol show hostnames $SLURM_NODELIST`;
my $node_cnt = $#nodes + 1;
my $task_cnt = $#tasks + 1;

if ($node_cnt < $task_cnt) {
	print STDERR "ERROR: You only have $node_cnt nodes, but requested layout on $task_cnt nodes.\n";
	$task_cnt = $node_cnt;
}

my $cnt = 0;
my $layout;
foreach my $task (@tasks) {
	my $node = $nodes[$cnt];
	last if !$node;
	chomp($node);
	for(my $i=0; $i < $task; $i++) {
		$layout .= "," if $layout;
		$layout .= "$node";
	}
	$cnt++;
}
print $layout;
</pre>

We can now use this script in our srun line in this fashion.<p>
<i>srun -m arbitrary -n5 -w `arbitrary.pl 4,1` -l hostname</i><p>
This will layout 4 tasks on the first node in the allocation and 1
task on the second node.</p>

<p><a name="hold"><b>21. How can I temporarily prevent a job from running 
(e.g. place it into a <i>hold</i> state)?</b></a><br>
The easiest way to do this is to change a job's earliest begin time
(optionally set at job submit time using the <i>--begin</i> option).
The example below places a job into hold state (preventing its initiation
for 30 days) and later permitting it to start now.</p>
<pre>
$ scontrol update JobId=1234 StartTime=now+30days
... later ...
$ scontrol update JobId=1234 StartTime=now
</pre>

<p class="footer"><a href="#top">top</a></p>


<h2>For Administrators</h2>

<p><a name="suspend"><b>1. How is job suspend/resume useful?</b></a><br>
Job suspend/resume is most useful to get particularly large jobs initiated 
in a timely fashion with minimal overhead. Say you want to get a full-system
job initiated. Normally you would need to either cancel all running jobs 
or wait for them to terminate. Canceling jobs results in the loss of 
their work to that point from either their beginning or last checkpoint.
Waiting for the jobs to terminate can take hours, depending upon your
system configuration. A more attractive alternative is to suspend the 
running jobs, run the full-system job, then resume the suspended jobs. 
This can easily be accomplished by configuring a special queue for 
full-system jobs and using a script to control the process. 
The script would stop the other partitions, suspend running jobs in those 
partitions, and start the full-system partition. 
The process can be reversed when desired.  
One can effectively gang schedule (time-slice) multiple jobs 
using this mechanism, although the algorithms to do so can get quite 
complex.
Suspending and resuming a job makes use of the SIGSTOP and SIGCONT 
signals respectively, so swap and disk space should be sufficient to 
accommodate all jobs allocated to a node, either running or suspended.

<p><a name="fast_schedule"><b>2. How can I configure SLURM to use 
the resources actually found on a node rather than what is defined 
in <i>slurm.conf</i>?</b></a><br>
SLURM can either base its scheduling decisions upon the node
configuration defined in <i>slurm.conf</i> or what each node 
actually returns as available resources. 
This is controlled using the configuration parameter <i>FastSchedule</i>.
Set its value to zero in order to use the resources actually 
found on each node, but with a higher overhead for scheduling.
A value of one is the default and results in the node configuration 
defined in <i>slurm.conf</i> being used. See &quot;man slurm.conf&quot;
for more details.</p>

<p><a name="return_to_service"><b>3. Why is a node shown in state 
DOWN when the node has registered for service?</b></a><br>
The configuration parameter <i>ReturnToService</i> in <i>slurm.conf</i>
controls how DOWN nodes are handled. 
Set its value to one in order for DOWN nodes to automatically be 
returned to service once the <i>slurmd</i> daemon registers 
with a valid node configuration.
A value of zero is the default and results in a node staying DOWN 
until an administrator explicitly returns it to service using 
the command &quot;scontrol update NodeName=whatever State=RESUME&quot;.
See &quot;man slurm.conf&quot; and &quot;man scontrol&quot; for more 
details.</p>

<p><a name="down_node"><b>4. What happens when a node crashes?</b></a><br>
A node is set DOWN when the slurmd daemon on it stops responding 
for <i>SlurmdTimeout</i> as defined in <i>slurm.conf</i>.
The node can also be set DOWN when certain errors occur or the 
node's configuration is inconsistent with that defined in <i>slurm.conf</i>.
Any active job on that node will be killed unless it was submitted 
with the srun option <i>--no-kill</i>.
Any active job step on that node will be killed. 
See the slurm.conf and srun man pages for more information.</p>
 
<p><a name="multi_job"><b>5. How can I control the execution of multiple 
jobs per node?</b></a><br>
There are two mechanisms to control this.
If you want to allocate individual processors on a node to jobs, 
configure <i>SelectType=select/cons_res</i>. 
See <a href="cons_res.html">Consumable Resources in SLURM</a>
for details about this configuration.  
If you want to allocate whole nodes to jobs, configure
configure <i>SelectType=select/linear</i>.
Each partition also has a configuration parameter <i>Shared</i>
that enables more than one job to execute on each node. 
See <i>man slurm.conf</i> for more information about these 
configuration parameters.</p>

<p><a name="inc_plugin"><b>6. When the SLURM daemon starts, it 
prints &quot;cannot resolve X plugin operations&quot; and exits. 
What does this mean?</b></a><br>
This means that symbols expected in the plugin were 
not found by the daemon. This typically happens when the 
plugin was built or installed improperly or the configuration 
file is telling the plugin to use an old plugin (say from the 
previous version of SLURM). Restart the daemon in verbose mode 
for more information (e.g. &quot;slurmctld -Dvvvvv&quot;). 

<p><a name="sigpipe"><b>7. Why are user tasks intermittently dying
at launch with SIGPIPE error messages?</b></a><br>
If you are using LDAP or some other remote name service for
username and groups lookup, chances are that the underlying
libc library functions are triggering the SIGPIPE.  You can likely
work around this problem by setting <i>CacheGroups=1</i> in your slurm.conf
file.  However, be aware that you will need to run &quot;scontrol
reconfigure &quot; any time your groups database is updated.

<p><a name="maint_time"><b>8. How can I dry up the workload for a 
maintenance period?</b></a><br>
Create a resource reservation as described by SLURM's 
<a href="reservations.html">Resource Reservation Guide</a>.

<p><a name="pam"><b>9. How can PAM be used to control a user's limits on 
or access to compute nodes?</b></a><br>
First, enable SLURM's use of PAM by setting <i>UsePAM=1</i> in 
<i>slurm.conf</i>.<br>
Second, establish a PAM configuration file for slurm in <i>/etc/pam.d/slurm</i>.
A basic configuration you might use is:</p>
<pre>
auth     required  pam_localuser.so
account  required  pam_unix.so
session  required  pam_limits.so
</pre>
<p>Third, set the desired limits in <i>/etc/security/limits.conf</i>.
For example, to set the locked memory limit to unlimited for all users:</p>
<pre>
*   hard   memlock   unlimited
*   soft   memlock   unlimited
</pre>
<p>Finally, you need to disable SLURM's forwarding of the limits from the 
session from which the <i>srun</i> initiating the job ran. By default 
all resource limits are propagated from that session. For example, adding 
the following line to <i>slurm.conf</i> will prevent the locked memory 
limit from being propagated:<i>PropagateResourceLimitsExcept=MEMLOCK</i>.</p>

<p>We also have a PAM module for SLURM that prevents users from 
logging into nodes that they have not been allocated (except for user 
root, which can always login. pam_slurm is available for download from
<a href="https://sourceforge.net/projects/slurm/">
https://sourceforge.net/projects/slurm/</a> or use the
<a href="http://www.debian.org/">Debian</a> package
named <i>libpam-slurm</i>.
The use of pam_slurm does not require <i>UsePAM</i> being set. The 
two uses of PAM are independent.

<p><a name="time"><b>10. Why are jobs allocated nodes and then unable 
to initiate programs on some nodes?</b></a><br>
This typically indicates that the time on some nodes is not consistent 
with the node on which the <i>slurmctld</i> daemon executes. In order to 
initiate a job step (or batch job), the <i>slurmctld</i> daemon generates 
a credential containing a time stamp. If the <i>slurmd</i> daemon 
receives a credential containing a time stamp later than the current 
time or more than a few minutes in the past, it will be rejected. 
If you check in the <i>SlurmdLog</i> on the nodes of interest, you 
will likely see messages of this sort: "<i>Invalid job credential from 
&lt;some IP address&gt;: Job credential expired</i>." Make the times 
consistent across all of the nodes and all should be well.

<p><a name="ping"><b>11. Why does <i>slurmctld</i> log that some nodes 
are not responding even if they are not in any partition?</b></a><br>
The <i>slurmctld</i> daemon periodically pings the <i>slurmd</i> 
daemon on every configured node, even if not associated with any 
partition. You can control the frequency of this ping with the 
<i>SlurmdTimeout</i> configuration parameter in <i>slurm.conf</i>.

<p><a name="controller"><b>12. How should I relocated the primary or 
backup controller?</b></a><br>
If the cluster's computers used for the primary or backup controller
will be out of service for an extended period of time, it may be desirable
to relocate them. In order to do so, follow this procedure:</p>
<ol>
<li>Stop all SLURM daemons</li>
<li>Modify the <i>ControlMachine</i>, <i>ControlAddr</i>, 
<i>BackupController</i>, and/or <i>BackupAddr</i> in the <i>slurm.conf</i> file</li>
<li>Distribute the updated <i>slurm.conf</i> file to all nodes</li>
<li>Restart all SLURM daemons</li>
</ol>
<p>There should be no loss of any running or pending jobs. Insure that
any nodes added to the cluster have a current <i>slurm.conf</i> file
installed.
<b>CAUTION:</b> If two nodes are simultaneously configured as the primary 
controller (two nodes on which <i>ControlMachine</i> specify the local host 
and the <i>slurmctld</i> daemon is executing on each), system behavior will be
destructive. If a compute node has an incorrect <i>ControlMachine</i> or
<i>BackupController</i> parameter, that node may be rendered unusable, but no
other harm will result.

<p><a name="multi_slurm"><b>13. Can multiple SLURM systems be run in 
parallel for testing purposes?</b></a><br>
Yes, this is a great way to test new versions of SLURM.
Just install the test version in a different location with a different 
<i>slurm.conf</i>. 
The test system's <i>slurm.conf</i> should specify different 
pathnames and port numbers to avoid conflicts.
The only problem is if more than one version of SLURM is configured 
with <i>switch/elan</i> or <i>switch/federation</i>.
In that case, there can be conflicting switch window requests from 
the different SLURM systems. 
This can be avoided by configuring the test system with <i>switch/none</i>.
MPI jobs started on an Elan or Federation switch system without the 
switch windows configured will not execute properly, but other jobs 
will run fine. 
Another option for testing on Elan or Federation systems is to use 
a different set of nodes for the different SLURM systems.
That will permit both systems to allocate switch windows without 
conflicts.

<p><a name="multi_slurmd"><b>14. Can slurm emulate a larger 
cluster?</b></a><br>
Yes, this can be useful for testing purposes. 
It has also been used to partition "fat" nodes into multiple SLURM nodes.
There are two ways to do this.
The best method for most conditions is to run one <i>slurmd</i> 
daemon per emulated node in the cluster as follows.
<ol>
<li>When executing the <i>configure</i> program, use the option 
<i>--enable-multiple-slurmd</i> (or add that option to your <i>~/.rpmmacros</i>
file).</li>
<li>Build and install SLURM in the usual manner.</li>
<li>In <i>slurm.conf</i> define the desired node names (arbitrary 
names used only by SLURM) as <i>NodeName</i> along with the actual
address of the physical node in <i>NodeHostname</i>. Multiple 
<i>NodeName</i> values can be mapped to a single
<i>NodeHostname</i>.  Note that each <i>NodeName</i> on a single
physical node needs to be configured to use a different port number.  You
will also want to use the "%n" symbol in slurmd related path options in
slurm.conf. </li>
<li>When starting the <i>slurmd</i> daemon, include the <i>NodeName</i>
of the node that it is supposed to serve on the execute line.</li> 
</ol>
<p>It is strongly recommended that SLURM version 1.2 or higher be used 
for this due to its improved support for multiple slurmd daemons.
See the
<a href="programmer_guide.html#multiple_slurmd_support">Programmers Guide</a>
for more details about configuring multiple slurmd support.</p>

<p>In order to emulate a really large cluster, it can be more 
convenient to use a single <i>slurmd</i> daemon. 
That daemon will not be able to launch many tasks, but can 
suffice for developing or testing scheduling software.
Do not run job steps with more than a couple of tasks each
or execute more than a few jobs at any given time.
Doing so may result in the <i>slurmd</i> daemon exhausting its 
memory and failing. 
<b>Use this method with caution.</b>
<ol>
<li>Execute the <i>configure</i> program with your normal options
plus <i>--enable-front-end</i> (this will define HAVE_FRONT_END in
the resulting <i>config.h</i> file.</li>
<li>Build and install SLURM in the usual manner.</li>
<li>In <i>slurm.conf</i> define the desired node names (arbitrary
names used only by SLURM) as <i>NodeName</i> along with the actual
name and address of the <b>one</b> physical node in <i>NodeHostName</i>
and <i>NodeAddr</i>.
Up to 64k nodes can be configured in this virtual cluster.</li>
<li>Start your <i>slurmctld</i> and one <i>slurmd</i> daemon.
It is advisable to use the "-c" option to start the daemons without 
trying to preserve any state files from previous executions. 
Be sure to use the "-c" option when switch from this mode too.</li>
<li>Create job allocations as desired, but do not run job steps
with more than a couple of tasks.</li>
</ol>

<pre>
$ ./configure --enable-debug --enable-front-end --prefix=... --sysconfdir=...
$ make install
$ grep NodeHostName slurm.conf
<i>NodeName=dummy[1-1200] NodeHostName=localhost NodeAddr=127.0.0.1</i>
$ slurmctld -c
$ slurmd -c
$ sinfo
<i>PARTITION AVAIL  TIMELIMIT NODES  STATE NODELIST</i>
<i>pdebug*      up      30:00  1200   idle dummy[1-1200]</i>
$ cat tmp
<i>#!/bin/bash</i>
<i>sleep 30</i>
$ srun -N200 -b tmp
<i>srun: jobid 65537 submitted</i>
$ srun -N200 -b tmp
<i>srun: jobid 65538 submitted</i>
$ srun -N800 -b tmp
<i>srun: jobid 65539 submitted</i>
$ squeue
<i>JOBID PARTITION  NAME   USER  ST  TIME  NODES NODELIST(REASON)</i>
<i>65537    pdebug   tmp  jette   R  0:03    200 dummy[1-200]</i>
<i>65538    pdebug   tmp  jette   R  0:03    200 dummy[201-400]</i>
<i>65539    pdebug   tmp  jette   R  0:02    800 dummy[401-1200]</i>
</pre>

<p><a name="extra_procs"><b>15. Can SLURM emulate nodes with more
resources than physically exist on the node?</b></a><br>
Yes in SLURM version 1.2 or higher.
In the <i>slurm.conf</i> file, set <i>FastSchedule=2</i> and specify
any desired node resource specifications (<i>Procs</i>, <i>Sockets</i>,
<i>CoresPerSocket</i>, <i>ThreadsPerCore</i>, and/or <i>TmpDisk</i>).
SLURM will use the resource specification for each node that is 
given in <i>slurm.conf</i> and will not check these specifications 
against those actually found on the node.

<p><a name="credential_replayed"><b>16. What does a 
&quot;credential replayed&quot; 
error in the <i>SlurmdLogFile</i> indicate?</b></a><br>
This error is indicative of the <i>slurmd</i> daemon not being able 
to respond to job initiation requests from the <i>srun</i> command
in a timely fashion (a few seconds).
<i>Srun</i> responds by resending the job initiation request.
When the <i>slurmd</i> daemon finally starts to respond, it 
processes both requests.
The second request is rejected and the event is logged with
the "credential replayed" error.
If you check the <i>SlurmdLogFile</i> and <i>SlurmctldLogFile</i>, 
you should see signs of the <i>slurmd</i> daemon's non-responsiveness.
A variety of factors can be responsible for this problem
including
<ul>
<li>Diskless nodes encountering network problems</li>
<li>Very slow Network Information Service (NIS)</li>
<li>The <i>Prolog</i> script taking a long time to complete</li>
</ul>
<p>In Slurm version 1.2, this can be addressed with the 
<i>MessageTimeout</i> configuration parameter by setting a
value higher than the default 5 seconds.
In earlier versions of Slurm, the <i>--msg-timeout</i> option 
of <i>srun</i> serves a similar purpose.

<p><a name="large_time"><b>17. What does 
&quot;Warning: Note very large processing time&quot; 
in the <i>SlurmctldLogFile</i> indicate?</b></a><br>
This error is indicative of some operation taking an unexpectedly
long time to complete, over one second to be specific.
Setting the value of <i>SlurmctldDebug</i> configuration parameter 
a value of six or higher should identify which operation(s) are 
experiencing long delays.
This message typically indicates long delays in file system access
(writing state information or getting user information). 
Another possibility is that the node on which the slurmctld 
daemon executes has exhausted memory and is paging. 
Try running the program <i>top</i> to check for this possibility.

<p><a name="lightweight_core"><b>18. How can I add support for 
lightweight core files?</b></a><br>
SLURM supports lightweight core files by setting environment variables
based upon the <i>srun --core</i> option. Of particular note, it 
sets the <i>LD_PRELOAD</i> environment variable to load new functions
used to process a core dump. 
>First you will need to acquire and install a shared object 
library with the appropriate functions.
Then edit the SLURM code in <i>src/srun/core-format.c</i> to 
specify a name for the core file type, 
add a test for the existence of the library, 
and set environment variables appropriately when it is used.

<p><a name="limit_propagation"><b>19. Is resource limit propagation
useful on a homogeneous cluster?</b></a><br>
Resource limit propagation permits a user to modify resource limits
and submit a job with those limits.
By default, SLURM automatically propagates all resource limits in 
effect at the time of job submission to the tasks spawned as part 
of that job. 
System administrators can utilize the <i>PropagateResourceLimits</i>
and <i>PropagateResourceLimitsExcept</i> configuration parameters to 
change this behavior.
Users can override defaults using the <i>srun --propagate</i> 
option. 
See <i>"man slurm.conf"</i> and <i>"man srun"</i> for more information 
about these options.

<p><a name="clock"><b>20. Do I need to maintain synchronized 
clocks on the cluster?</b></a><br>
In general, yes. Having inconsistent clocks may cause nodes to 
be unusable. SLURM log files should contain references to 
expired credentials. For example:
<pre>
error: Munge decode failed: Expired credential
ENCODED: Wed May 12 12:34:56 2008
DECODED: Wed May 12 12:01:12 2008
</pre>

<p><a name="cred_invalid"><b>21. Why are &quot;Invalid job credential&quot; 
errors generated?</b></a><br>
This error is indicative of SLURM's job credential files being inconsistent across 
the cluster. All nodes in the cluster must have the matching public and private 
keys as defined by <b>JobCredPrivateKey</b> and <b>JobCredPublicKey</b> in the 
slurm configuration file <b>slurm.conf</b>.

<p><a name="cred_replay"><b>22. Why are 
&quot;Task launch failed on node ... Job credential replayed&quot; 
errors generated?</b></a><br>
This error indicates that a job credential generated by the slurmctld daemon 
corresponds to a job that the slurmd daemon has already revoked. 
The slurmctld daemon selects job ID values based upon the configured 
value of <b>FirstJobId</b> (the default value is 1) and each job gets 
a value one larger than the previous job.
On job termination, the slurmctld daemon notifies the slurmd on each 
allocated node that all processes associated with that job should be 
terminated. 
The slurmd daemon maintains a list of the jobs which have already been 
terminated to avoid replay of task launch requests. 
If the slurmctld daemon is cold-started (with the &quot;-c&quot; option 
or &quot;/etc/init.d/slurm startclean&quot;), it starts job ID values 
over based upon <b>FirstJobId</b>.
If the slurmd is not also cold-started, it will reject job launch requests 
for jobs that it considers terminated. 
This solution to this problem is to cold-start all slurmd daemons whenever
the slurmctld daemon is cold-started.

<p><a name="globus"><b>23. Can SLURM be used with Globus?</b></a><br>
Yes. Build and install SLURM's Torque/PBS command wrappers along with 
the Perl APIs from SLURM's <i>contribs</i> directory and configure 
<a href="http://www-unix.globus.org/">Globus</a> to use those PBS commands.
Note there are RPMs available for both of these packages, named 
<i>torque</i> and <i>perlapi</i> respectively.

<p><a name="time_format"><b>24. Can SLURM time output format include the 
year?</b></a><br>
The default SLURM time format output is <i>MM/DD-HH:MM:SS</i>. 
Define &quot;ISO8601&quot; at SLURM build time to get the time format
<i>YYYY-MM-DDTHH:MM:SS</i>.
Note that this change in format will break anything that parses 
SLURM output expecting the old format (e.g. LSF, Maui or Moab).

<p><a name="file_limit"><b>25. What causes the error 
&quot;Unable to accept new connection: Too many open files&quot;?</b></a><br>
The srun command automatically increases its open file limit to 
the hard limit in order to process all of the standard input and output
connections to the launched tasks. It is recommended that you set the
open file hard limit to 8192 across the cluster.

<p><a name="slurmd_log"><b>26. Why does the setting of <i>SlurmdDebug</i> 
fail to log job step information at the appropriate level?</b></a><br>
There are two programs involved here. One is <b>slurmd</b>, which is 
a persistent daemon running at the desired debug level. The second 
program is <b>slurmstep</b>, which executed the user job and its
debug level is controlled by the user. Submitting the job with 
an option of <i>--debug=#</i> will result in the desired level of 
detail being logged in the <i>SlurmdLogFile</i> plus the output 
of the program.

<p><a name="rpm"><b>27. Why isn't the auth_none.so (or other file) in a 
SLURM RPM?</b></a><br>
The auth_none plugin is in a separate RPM and not built by default.
Using the auth_none plugin means that SLURM communications are not 
authenticated, so you probably do not want to run in this mode of operation 
except for testing purposes. If you want to build the auth_none RPM then 
add <i>--with auth_none</i> on the rpmbuild command line or add 
<i>%_with_auth_none</i> to your ~/rpmmacros file. See the file slurm.spec 
in the SLURM distribution for a list of other options.

<p><a name="slurmdbd"><b>28. Why should I use the slurmdbd instead of the
regular database plugins?</b></a><br>
While the normal storage plugins will work fine without the added
layer of the slurmdbd there are some great benefits to using the
slurmdbd.

1. Added security.  Using the slurmdbd you can have an authenticated
   connection to the database.
2. Off loading processing from the controller.  With the slurmdbd there is no
   slow down to the controller due to a slow or overloaded database.
3. Keeping enterprise wide accounting from all slurm clusters in one database.
   The slurmdbd is multi-threaded and designed to handle all the
   accounting for the entire enterprise.
4. With the new database plugins 1.3+ you can query with sacct
   accounting stats from any node slurm is installed on.  With the
   slurmdbd you can also query any cluster using the slurmdbd from any
   other cluster's nodes.

<p><a name="debug"><b>29. How can I build SLURM with debugging symbols?</b></a></br>
Set your CFLAGS environment variable before building. 
You want the "-g" option to produce debugging information and
"-O0" to set the optimization level to zero (off). For example:<br>
CFLAGS="-g -O0" ./configure ...

<p><a name="state_preserve"><b>30. How can I easily preserve drained node 
information between major SLURM updates?</b></a><br>
Major SLURM updates generally have changes in the state save files and 
communication protocols, so a cold-start (without state) is generally 
required. If you have nodes in a DRAIN state and want to preserve that
information, you can easily build a script to preserve that information
using the <i>sinfo</i> command. The following command line will report the 
<i>Reason</i> field for every node in a DRAIN state and write the output 
in a form that can be executed later to restore state.
<pre>
sinfo -t drain -h -o "scontrol update nodename='%N' state=drain reason='%E'"
</pre>

<p><a name="health_check"><b>31. Why doesn't the <i>HealthCheckProgram</i>
execute on DOWN nodes?</a></b><br>
Hierarchical communications are used for sending this message. If there
are DOWN nodes in the communications hierarchy, messages will need to 
be re-routed. This limits SLURM's ability to tightly synchronize the
execution of the <i>HealthCheckProgram</i> across the cluster, which
could adversely impact performance of parallel applications. 
The use of CRON or node startup scripts may be better suited to insure
that <i>HealthCheckProgram</i> gets executed on nodes that are DOWN
in SLURM. If you still want to have SLURM try to execute 
<i>HealthCheckProgram</i> on DOWN nodes, apply the following patch:
<pre>
Index: src/slurmctld/ping_nodes.c
===================================================================
--- src/slurmctld/ping_nodes.c  (revision 15166)
+++ src/slurmctld/ping_nodes.c  (working copy)
@@ -283,9 +283,6 @@
                node_ptr   = &node_record_table_ptr[i];
                base_state = node_ptr->node_state & NODE_STATE_BASE;

-               if (base_state == NODE_STATE_DOWN)
-                       continue;
-
 #ifdef HAVE_FRONT_END          /* Operate only on front-end */
                if (i > 0)
                        continue;
</pre>

<p><a name="batch_lost"><b>32. What is the meaning of the error 
&quot;Batch JobId=# missing from master node, killing it&quot;?</b></a><br>
A shell is launched on node zero of a job's allocation to execute
the submitted program. The <i>slurmd</i> daemon executing on each compute
node will periodically report to the <i>slurmctld</i> what programs it
is executing. If a batch program is expected to be running on some
node (i.e. node zero of the job's allocation) and is not found, the
message above will be logged and the job cancelled. This typically is 
associated with exhausting memory on the node or some other critical 
failure that cannot be recovered from. The equivalent message in 
earlier releases of slurm is 
&quot;Master node lost JobId=#, killing it&quot;.

<p><a name="accept_again"><b>33. What does the messsage
&quot;srun: error: Unable to accept connection: Resources temporarily unavailable&quot; 
indicate?</b></a><br>
This has been reported on some larger clusters running SUSE Linux when
a user's resource limits are reached. You may need to increase limits
for locked memory and stack size to resolve this problem.

<p><a name="task_prolog"><b>34. How could I automatically print a job's 
SLURM job ID to its standard output?</b></a></br>
The configured <i>TaskProlog</i> is the only thing that can write to 
the job's standard output or set extra environment variables for a job
or job step. To write to the job's standard output, precede the message
with "print ". To export environment variables, output a line of this
form "export name=value". The example below will print a job's SLURM
job ID and allocated hosts for a batch job only.

<pre>
#!/bin/sh
#
# Sample TaskProlog script that will print a batch job's
# job ID and node list to the job's stdout
#

if [ X"$SLURM_STEP_ID" = "X" -a X"$SLURM_PROCID" = "X"0 ]
then
  echo "print =========================================="
  echo "print SLURM_JOB_ID = $SLURM_JOB_ID"
  echo "print SLURM_NODELIST = $SLURM_NODELIST"
  echo "print =========================================="
fi
</pre>

<p><a name="moab_start"><b>35. I run SLURM with the Moab or Maui scheduler.
How can I start a job under SLURM without the scheduler?</b></a></br>
When SLURM is configured to use the Moab or Maui scheduler, all submitted
jobs have their priority initialized to zero, which SLURM treats as a held
job. The job only begins when Moab or Maui decide where and when to start
the job, setting the required node list and setting the job priority to 
a non-zero value. To circumvent this, submit your job using a SLURM or
Moab command then manually set its priority to a non-zero value (must be
done by user root). For example:</p>
<pre>
$ scontrol update jobid=1234 priority=1000000
</pre>
<p>Note that changes in the configured value of <i>SchedulerType</i> only
take effect when the <i>slurmctld</i> daemon is restarted (reconfiguring
SLURM will not change this parameter. You will also manually need to
modify the priority of every pending job. 
When changing to Moab or Maui scheduling, set every job priority to zero. 
When changing from Moab or Maui scheduling, set every job priority to a
non-zero value (preferably fairly large, say 1000000).</p>

<p><a name="orphan_procs"><b>36. Why are user processes and <i>srun</i>
running even though the job is supposed to be completed?</b></a></br>
SLURM relies upon a configurable process tracking plugin to determine
when all of the processes associated with a job or job step have completed.
Those plugins relying upon a kernel patch can reliably identify every process.
Those plugins dependent upon process group IDs or parent process IDs are not 
reliable. See the <i>ProctrackType</i> description in the <i>slurm.conf</i>
man page for details. We rely upon the sgi_job for most systems.</p>

<p><a name="slurmd_oom"><b>37. How can I prevent the <i>slurmd</i> and
<i>slurmstepd</i> daemons from being killed when a node's memory 
is exhausted?</b></a></br>
You can the value set in the <i>/proc/self/oom_adj</i> for 
<i>slurmd</i> and <i>slurmstepd</i> by initiating the <i>slurmd</i>
daemon with the <i>SLURMD_OOM_ADJ</i> and/or <i>SLURMSTEPD_OOM_ADJ</i>
environment variables set to the desired values.
A value of -17 typically will disable killing.</p>

<p><a name="ubuntu"><b>38. I see my host of my calling node as 127.0.1.1
    instead of the correct ip address.  Why is that?</b></a></br>
Some systems by default will put your host in the /etc/hosts file as
    something like 
<pre>
127.0.1.1	snowflake.llnl.gov	snowflake
</pre>
This will cause srun and other things to grab 127.0.1.1 as it's
address instead of the correct address and make it so the
communication doesn't work.  Solution is to either remove this line or
set a different nodeaddr that is known by your other nodes.</p>

<p class="footer"><a href="#top">top</a></p>

<p style="text-align:center;">Last modified 12 June 2009</p>

<!--#include virtual="footer.txt"-->