Submitting jobs
What is sbatch?
Slurm has a lot of options to manage all the resources of a cluster to achieve any possible combination of needs like: Number of CPUs, Number of Nodes, Memory, Time, GPUs, Licenses, etc.
The command sbatch is used to submit a batch script, making your job
running in the cluster. Like this:
$ sbatch <batch_script>
A Slurm batch is a shell script (usually written in bash) where you
specify all these options to Slurm, including the creation of the environment to
make your job run correctly, and the set of commands to run that job.
Thus, we say that a batch script has three parts:
Sbatch parameters:
The idea is to include all the information you think Slurm should know about your job (name, notification mail, partition, std_out, std_err, etc) and request all your computational needs, which consist at least in a number of CPUs, the computing expected duration and the amount of RAM to use.
All these parameters must start with the comment
#SBATCH, one per line, and need to be included at the beginning of the file, just after the shebang (e.g. #!/bin/bash) which should be the first line.The following table [3] shows important and common options, for further information see
man sbatch.Sbatch option’s Option
Description
Possible value
Mandatory
-J, --job-nameJob’s name
Letters and numbers
no
-t, --timeMaximum Walltime of the job
Numbers with the format DD-HH:MM:SS
yes
--memRequested memory per node
size with units: 64G, 600M
no
-n, --ntasksNumber of tasks of the job
Number
no (default 1)
--ntasks-per-nodeNumber of tasks assigned to a node
Number
no (default 1)
-N, --nodesNumber of nodes requested
Number
no (default 1)
-c, --cpus-per-taskNumber of threads per task
Number
no (default 1)
-p, --partitionPartition/queue where the job will be submited
longjobs, bigmem, accel and debug
no (default longjobs)
--outputFile where the standard output will be written
Letters and numbers
no
--errorFile where the standard error will be written
Letters and numbers
no
--mail-typeNotify user by email when certain event types occur to the job
NONE, ALL, BEGIN, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_%
no
--mail-userEmail to receive notification of state shanges
Valid email
no
--exclusiveThe job allocation can not share nodes with other running jobs
Does not have values
no
--test-onlyValidate the batch script and return an estimate of when a job would be scheduled to run
Does not have values
no
--constraintSome nodes have features associated with them. Use this option to specify which features the nodes associated with your job must have
The name of the feature to use
no
Note
Each option must be included using
#SBATCH <option>=<value>Warning
Some values of the options/parameters may be specific for our clusters.
Note
About the
--mail-typeoption, the valueTIME_LIMIT_%means the reached time percent, thus,TIME_LIMIT_90notify reached the 90% of walltime,TIME_LIMIT_50at the 50%, etc.Environment creation
Next, you should create the necessary environment to make your job run correctly. This often means include the same set of steps that you do to run your application locally on your sbatch script, things like export environment variables, create or delete files and directory structures, etc. Remember a Slurm script is a shell script.
In case you want to submit a job that uses an application that is installed in our clusters you have to
loadits module.An application Module. is used to create the specific environment needed by your application.
The following table [1] show useful commands about modules.
Module useful commands Command
Functionality
module availcheck what software packages are available
module whatis <module-name>Find out more about a software package
module help <module-name>A module file may include more detailed help for the software package
module show <module-name>see exactly what effect loading the module will have with
module listcheck which modules are currently loaded in your environment
module load <module-name>load a module
module unload <module-name>unload a module
module purgeremove all loaded modules from your environment
Warning
Slurm always propagate the environment of the current user to the job. This could impact the behavior of the job. If you want a clean environment, add
#SBATCH --export=NONEto your sbatch script. This option is particularly important for jobs that are submitted on one cluster and execute on a different cluster (e.g. with different paths).Job(s) steps
Finally, you put the command(s) that executes your application, including all the parameters. You will often see the command
sruncalling the executable instead of executing the application binary. For more information see MPI jobs section.
There are other options beyond using sbatch to submit jobs to Slurm,
like salloc or simply using srun. We recommend using sbatch, but
depending on the specific need of your application those options could be better.
To know more about see: FAQ and Testing my job
Serial jobs
Serial jobs only use a process with one execution thread, this means one core of a CPU, given our configuration without HTT (Hyper-Threading Technology).
This kind of job does not take advantage of our computational resources but is the basic step to create more complex jobs.
In terms of Slurm, this job uses one task (process) and one
cpu-per-task (thread) in one node. In fact, we don’t need to specify any
resource, the default value for those options in Slurm is 1.
Here is a good article about the differences between Processes and
Threads.
In the template below we specify ntasks=1 to make it explicit.
#!/bin/bash
#SBATCH --job-name=serial_test # Job name
#SBATCH --mail-type=FAIL,END # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-serial.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=slurm-serial.%j.err # Stderr (%j expands to jobId)
#SBATCH --ntasks=1 # Number of tasks (processes)
#SBATCH --time=01:00 # Walltime
#SBATCH --partition=longjobs # Partition
##### ENVIRONMENT CREATION #####
##### JOB COMMANDS ####
hostname
date
sleep 50
MPI jobs
MPI jobs are able to launch multiple processes on multiple nodes. There is a lot of possible workflows using MPI, here we are going to explain a basic one. Based on this example and modifying its parameters, you can find the configuration for your specific need.
The example was compiled in Cronos using impi as follow:
$ module load impi
$ impicc hello_world_mpi.c -o mpi_hello_world_apolo
We submited the classic “Hello world” MPI example [6] using 5 processes (--ntasks=5),
each one on a different machine (--ntasks-per-node=1). Just to be clear,
we used 5 machines and 1 CPU per each, leaving the other CPUs
(15, in this specific case) free to be allocated by Slurm to other jobs.
#!/bin/bash
#SBATCH --job-name=mpi_test # Job name
#SBATCH --mail-type=FAIL,END # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-mpi.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=slurm-mpi.%j.err # Stderr (%j expands to jobId)
#SBATCH --time=01:00 # Walltime
#SBATCH --partition=longjobs # Partition
#SBATCH --ntasks=5 # Number of tasks (processes)
#SBATCH --ntasks-per-node=1 # Number of task per node (machine)
##### ENVIRONMENT CREATION #####
module load impi
##### JOB COMMANDS ####
srun --mpi=pmi2 ./mpi_hello_world_apolo
Note
The use of srun is mandatory here. It creates the necessary
environment to launch the MPI processes. There you can also specify other parameters.
See man srun to more information.
Also, the use of --mpi=pmi2 is mandatory, it tells MPI to use the pmi2 Slurm’s
plugin. This could change when you are using a different implementation of MPI
(e.g MVAPICH, OpenMPI) but we strongly encourage our users to specify it.
Output
HELLO_MPI - Master process:
C/MPI version
An MPI example program.
Process 3 says 'Hello, world!'
The number of processes is 5.
Process 0 says 'Hello, world!'
Elapsed wall clock time = 0.000019 seconds.
Process 1 says 'Hello, world!'
Process 4 says 'Hello, world!'
Process 2 says 'Hello, world!'
HELLO_MPI - Master process:
Normal end of execution: 'Goodbye, world!'
30 January 2019 09:29:56 AM
Warning
As you can see in that example, we do not specify -N or --nodes to submit
the job in 5 different machines. You can let Slurm decides how many machines your
job needs.
Try to think in terms of “tasks” rather than “nodes”.
This table shows some other useful cases [2]:
You want |
You ask |
|---|---|
N CPUs |
|
N CPUs spread across distinct nodes |
|
N CPUs spread across distinct nodes and nobody else around |
|
N CPUs spread across N/2 nodes |
|
N CPUs on the same node |
|
Array jobs
Also called Embarrassingly-Parallel, this set up is commonly used by users
that do not have a native parallel application, so they run multiple parallel
instances of their application changing its input. Each instance is
independent and does not have any kind of communication with others.
To do this, we specify an array using the sbatch parameter --array,
multiple values may be specified using a comma-separated list and/or a
range of values with a “-” separator (e.g --array=1,3,5-10 or --array=1,2,3).
This will be the values that the variable SLURM_ARRAY_TASK_ID is
going to take in each array-job.
This input usually refers to these cases:
File input
You have multiple files/directories to process.
In the below example/template we made a “parallel copy” of the files contained in
testdirectory using thecpcommand../test/ ├── file1.txt ├── file2.txt ├── file3.txt ├── file4.txt └── file5.txt
We used one process (called
taskin Slurm) per eacharray-job. The array goes from 0 to 4, so there were 5 processes copying the 5 files contained in thetestdirectory.#!/bin/bash #SBATCH --job-name=array_file_test # Job name #SBATCH --mail-type=FAIL,END # Mail notification #SBATCH --mail-user=<user>@<domain> # User Email #SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid ) #SBATCH --error=slurm-array%J.err # Stderr (%J expands to GlobalJobid) #SBATCH --ntasks=1 # Number of tasks (processes) for each array-job #SBATCH --time=01:00 # Walltime for each array-job #SBATCH --partition=debug # Partition #SBATCH --array=0-4 # Array index ##### ENVIRONMENT CREATION ##### ##### JOB COMMANDS #### # Array of files files=(./test/*) # Work based on the SLURM_ARRAY_TASK_ID srun cp ${files[$SLURM_ARRAY_TASK_ID]} copy_$SLURM_ARRAY_TASK_ID
Thus, the generated file
copy_0is the copy of the filetest/file1.txtand the filecopy_1is the copy of the filetest2.txtand so on. Each one was done by a different Slurm process in parallel.
Warning
Except to --array, ALL other #SBATCH options specified in the
submitting Slurm script are used to configure EACH job-array, including
ntasks, ntasks-per-node, time, mem, etc.
Parameters input
You have multiple parameters to process.
Similarly to the last example, we created an array with some values that we wanted to use as parameters of the application. We used one process (
task) perarray-job. We had 4 parameters (0.05 100 999 1295.5) to process and 4array-jobs.Force Slurm to run array-jobs in different nodes
To give another feature to this example, we used
1node for eacharray-job, so, even knowing that one node can run up to 16 processes (in the case of Cronos) and the 4array-jobscould be assigned to1node, we forced Slurm to use4nodes.To get this we use the parameter
--exclusive, thus, for eachjob-arraySlurm will care about not to have other Slurm-job in the same node, even other of yourjob-array.Note
Just to be clear, the use of
--exclusiveas a SBATCH parameter tells Slurm that the job allocation cannot share nodes with other running jobs [4] . However, it has a slightly different meaning when you use it as a parameter of a job-step (each separate srun execution inside a SBATCH script, e.gsrun --exclusive $COMMAND). For further information seeman srun.
#!/bin/bash #SBATCH --job-name=array_params_test # Job name #SBATCH --mail-type=FAIL,END # Mail notification #SBATCH --mail-user=<user>@<domain> # User Email #SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid ) #SBATCH --error=slurm-array%J.err # Stderr (%J expands to GlobalJobid) #SBATCH --ntasks=1 # Number of tasks (processes) for each array-job #SBATCH --time=01:00 # Walltime for each array-job #SBATCH --partition=debug # Partition #SBATCH --array=0-3 # Array index #SBATCH --exclusive # Force slurm to use 4 different nodes ##### ENVIRONMENT CREATION ##### ##### JOB COMMANDS #### # Array of params params=(0.05 100 999 1295.5) # Work based on the SLURM_ARRAY_TASK_ID srun echo ${params[$SLURM_ARRAY_TASK_ID]}
Remember that the main idea behind using Array jobs in Slurm is based on the
use of the variable SLURM_ARRAY_TASK_ID.
Note
The parameter ntasks specify the number of processes that EACH
array-job is going to use. So if you want to use more, you
just can specify it. This idea also applies to all other sbatch
parameters.
Note
You can also limit the number of simultaneously running tasks from the job
array using a % separator. For example --array=0-15%4 will limit the
number of simultaneously running tasks from this job array to 4.
Slurm’s environment variables
In the above examples, we often used the output of the environment variables provided by Slurm. Here you have a table [3] with the most common variables.
Variable |
Functionality |
|---|---|
|
job Id |
|
Index of the slurm array |
|
Same as |
|
Same as |
|
Number of nodes allocated to job |
|
The directory from which |
Slurm’s file-patterns
sbatch allows filename patterns, this could be useful to name std_err and
std_out files. Here you have a table [3] with some of them.
File-patern |
Expands to |
|---|---|
|
Job array’s master job allocation number |
|
Job array ID (index) number |
|
jobid of the running job |
|
Job name |
|
short hostname. This will create a separate IO file per node |
Note
If you need to separate the output of a job per each node requested, %N is
specially useful, for example in array-jobs.
For instance, if you use #SBATCH --output=job-%A.%a in an array-job the output
files will be something like job-1234.1, job-1234.2 , job-1234.3;
where: 1234 refers to the job array’s master job allocation number and 1
, 2 and 3 refers to the id of each job-array.
Constraining Features on a job
In Apolo II, one can specify what type of CPU instruction set to use. One can choose
between AVX2 and
AVX512. These features can be specify
using the SBATCH option --constraint=<list> where <list> is the features to constrain.
For example, --constraint="AVX2" will allocate only nodes that have AVX2 in their instruction
set. --constraint="AVX2|AVX512" will allocate only nodes that have either AVX512 or AVX2.
One can also have a job requiring some nodes to have AVX2 and some others using AVX512. For this
one would use operators ‘&’ and ‘*’. The ampersand works as a ‘and’ operator, and the
‘*’ is used to specify the number of nodes that must comply a single feature. For example,
--constraint="[AVX2*2&AVX512*3]" is asking for two nodes with AVX2 and three with AVX512.
The squared brackets are mandatory.