Submitting jobs

What is `sbatch`?

Slurm has a lot of options to manage all the resources of a cluster to achieve any possible combination of needs like: Number of CPUs, Number of Nodes, Memory, Time, GPUs, Licenses, etc.

The command sbatch is used to submit a batch script, making your job running in the cluster. Like this:

$ sbatch <batch_script>

A Slurm batch is a shell script (usually written in bash) where you specify all these options to Slurm, including the creation of the environment to make your job run correctly, and the set of commands to run that job.

Thus, we say that a batch script has three parts:

Sbatch parameters:

The idea is to include all the information you think Slurm should know about your job (name, notification mail, partition, std_out, std_err, etc) and request all your computational needs, which consist at least in a number of CPUs, the computing expected duration and the amount of RAM to use.

All these parameters must start with the comment #SBATCH, one per line, and need to be included at the beginning of the file, just after the shebang (e.g. #!/bin/bash) which should be the first line.

The following table [3] shows important and common options, for further information see man sbatch.

Sbatch option’s
Option	Description	Possible value	Mandatory
`-J, --job-name`	Job’s name	Letters and numbers	no
`-t, --time`	Maximum Walltime of the job	Numbers with the format DD-HH:MM:SS	yes
`--mem`	Requested memory per node	size with units: 64G, 600M	no
`-n, --ntasks`	Number of tasks of the job	Number	no (default 1)
`--ntasks-per-node`	Number of tasks assigned to a node	Number	no (default 1)
`-N, --nodes`	Number of nodes requested	Number	no (default 1)
`-c, --cpus-per-task`	Number of threads per task	Number	no (default 1)
`-p, --partition`	Partition/queue where the job will be submited	longjobs, bigmem, accel and debug	no (default longjobs)
`--output`	File where the standard output will be written	Letters and numbers	no
`--error`	File where the standard error will be written	Letters and numbers	no
`--mail-type`	Notify user by email when certain event types occur to the job	NONE, ALL, BEGIN, FAIL, REQUEUE, TIME_LIMIT, TIME_LIMIT_%	no
`--mail-user`	Email to receive notification of state shanges	Valid email	no
`--exclusive`	The job allocation can not share nodes with other running jobs	Does not have values	no
`--test-only`	Validate the batch script and return an estimate of when a job would be scheduled to run	Does not have values	no
`--constraint`	Some nodes have features associated with them. Use this option to specify which features the nodes associated with your job must have	The name of the feature to use	no

Note

Each option must be included using #SBATCH <option>=<value>

Warning

Some values of the options/parameters may be specific for our clusters.

Note

About the --mail-type option, the value TIME_LIMIT_% means the reached time percent, thus, TIME_LIMIT_90 notify reached the 90% of walltime, TIME_LIMIT_50 at the 50%, etc.

Environment creation

Next, you should create the necessary environment to make your job run correctly. This often means include the same set of steps that you do to run your application locally on your sbatch script, things like export environment variables, create or delete files and directory structures, etc. Remember a Slurm script is a shell script.

In case you want to submit a job that uses an application that is installed in our clusters you have to load its module.

An application Module. is used to create the specific environment needed by your application.

The following table [1] show useful commands about modules.

Module useful commands
Command	Functionality
`module avail`	check what software packages are available
`module whatis <module-name>`	Find out more about a software package
`module help <module-name>`	A module file may include more detailed help for the software package
`module show <module-name>`	see exactly what effect loading the module will have with
`module list`	check which modules are currently loaded in your environment
`module load <module-name>`	load a module
`module unload <module-name>`	unload a module
`module purge`	remove all loaded modules from your environment

Warning

Slurm always propagate the environment of the current user to the job. This could impact the behavior of the job. If you want a clean environment, add #SBATCH --export=NONE to your sbatch script. This option is particularly important for jobs that are submitted on one cluster and execute on a different cluster (e.g. with different paths).

Job(s) steps

Finally, you put the command(s) that executes your application, including all the parameters. You will often see the command srun calling the executable instead of executing the application binary. For more information see MPI jobs section.

There are other options beyond using sbatch to submit jobs to Slurm, like salloc or simply using srun. We recommend using sbatch, but depending on the specific need of your application those options could be better. To know more about see: FAQ and Testing my job

Serial jobs 

Serial jobs only use a process with one execution thread, this means one core of a CPU, given our configuration without HTT (Hyper-Threading Technology).

This kind of job does not take advantage of our computational resources but is the basic step to create more complex jobs.

In terms of Slurm, this job uses one task (process) and one cpu-per-task (thread) in one node. In fact, we don’t need to specify any resource, the default value for those options in Slurm is 1.

Here is a good article about the differences between Processes and Threads.

In the template below we specify ntasks=1 to make it explicit.

serial-template.sh

#!/bin/bash

#SBATCH --job-name=serial_test       # Job name
#SBATCH --mail-type=FAIL,END         # Mail notification
#SBATCH --mail-user=<user>@<domain>  # User Email
#SBATCH --output=slurm-serial.%j.out # Stdout (%j expands to jobId)
#SBATCH --error=slurm-serial.%j.err  # Stderr (%j expands to jobId)
#SBATCH --ntasks=1                   # Number of tasks (processes)
#SBATCH --time=01:00                 # Walltime
#SBATCH --partition=longjobs         # Partition


##### ENVIRONMENT CREATION #####



##### JOB COMMANDS ####
hostname
date
sleep 50

Shared Memory jobs (OpenMP)

This set up is made to create parallelism using threads on a single machine. OpenMP makes communication between threads (-c in Slurm) but they must be on the same machine, it does not make any kind of communication between process/threads of different physical machines.

In the below example we launched the classical “Hello world” OpenMP example [5]. It was compiled in Cronos using intel compiler 18.0.1 as follow:

$ module load intel/18.0.1
$ icc -fopenmp omp_hello.c -o hello_omp_intel_cronos

We used 16 threads, the maximum number allowed in the Cronos’ longjobs partition. In terms of Slurm, we specify 16 cpus-per-task and one ntasks.

openmp-template.sh

#!/bin/bash

#SBATCH --job-name=openmp_test      # Job name
#SBATCH --mail-type=FAIL,END        # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-omp.%j.out   # Stdout (%j expands to jobId)
#SBATCH --error=slurm-omp.%j.err    # Stderr (%j expands to jobId)
#SBATCH --time=01:00                # Walltime
#SBATCH --partition=longjobs        # Partition
#SBATCH --ntasks=1                  # Number of tasks (processes)
#SBATCH --cpus-per-task=16          # Number of threads per task (Cronos-longjobs)


##### ENVIRONMENT CREATION #####
module load intel/18.0.1


##### JOB COMMANDS ####
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

./hello_omp_intel_cronos

Output

Hello World from thread = 8
Hello World from thread = 0
Number of threads = 16
Hello World from thread = 4
Hello World from thread = 15
Hello World from thread = 5
Hello World from thread = 3
Hello World from thread = 2
Hello World from thread = 10
Hello World from thread = 9
Hello World from thread = 1
Hello World from thread = 11
Hello World from thread = 6
Hello World from thread = 12
Hello World from thread = 14
Hello World from thread = 7
Hello World from thread = 13

Warning

Remember the maximum number of total threads that can be running at the same time in a compute node.

Apolo:
- Longjobs queue: 32
- Accel queue: 32
- Bigmem queue: 24
- Debug queue: 2
Cronos:
- Longjobs queue: 16

Otherwise, your job will overpass the maximum multiprocessing grade and this is going to cause a drastic decrease in the performance of your application. To know more about see: FAQ

As extra information, our setup does not use HTT (Hyper-Threading Technology).

Note

We highly recommend using the Slurm variable $SLURM_CPUS_PER_TASK to specify the number of threads that OpenMP is going to work with. Most of the applications use the variable OMP_NUM_THREADS to defined it.

MPI jobs 

MPI jobs are able to launch multiple processes on multiple nodes. There is a lot of possible workflows using MPI, here we are going to explain a basic one. Based on this example and modifying its parameters, you can find the configuration for your specific need.

The example was compiled in Cronos using impi as follow:

$ module load impi
$ impicc hello_world_mpi.c -o mpi_hello_world_apolo

We submited the classic “Hello world” MPI example [6] using 5 processes (--ntasks=5), each one on a different machine (--ntasks-per-node=1). Just to be clear, we used 5 machines and 1 CPU per each, leaving the other CPUs (15, in this specific case) free to be allocated by Slurm to other jobs.

mpi-template.sh

#!/bin/bash

#SBATCH --job-name=mpi_test         # Job name
#SBATCH --mail-type=FAIL,END        # Mail notification
#SBATCH --mail-user=<user>@<domain> # User Email
#SBATCH --output=slurm-mpi.%j.out   # Stdout (%j expands to jobId)
#SBATCH --error=slurm-mpi.%j.err    # Stderr (%j expands to jobId)
#SBATCH --time=01:00                # Walltime
#SBATCH --partition=longjobs        # Partition
#SBATCH --ntasks=5                  # Number of tasks (processes)
#SBATCH --ntasks-per-node=1         # Number of task per node (machine)


##### ENVIRONMENT CREATION #####
module load impi


##### JOB COMMANDS ####
srun --mpi=pmi2 ./mpi_hello_world_apolo

Note

The use of srun is mandatory here. It creates the necessary environment to launch the MPI processes. There you can also specify other parameters. See man srun to more information.

Also, the use of --mpi=pmi2 is mandatory, it tells MPI to use the pmi2 Slurm’s plugin. This could change when you are using a different implementation of MPI (e.g MVAPICH, OpenMPI) but we strongly encourage our users to specify it.

Output

HELLO_MPI - Master process:
  C/MPI version
  An MPI example program.

  Process 3 says 'Hello, world!'
  The number of processes is 5.

  Process 0 says 'Hello, world!'
  Elapsed wall clock time = 0.000019 seconds.
  Process 1 says 'Hello, world!'
  Process 4 says 'Hello, world!'
  Process 2 says 'Hello, world!'

HELLO_MPI - Master process:
  Normal end of execution: 'Goodbye, world!'

30 January 2019 09:29:56 AM

Warning

As you can see in that example, we do not specify -N or --nodes to submit the job in 5 different machines. You can let Slurm decides how many machines your job needs.

Try to think in terms of “tasks” rather than “nodes”.

This table shows some other useful cases [2]:

MPI jobs table
You want	You ask
N CPUs	`--ntasks=N`
N CPUs spread across distinct nodes	`--ntasks=N --nodes=N`
N CPUs spread across distinct nodes and nobody else around	`--ntasks=N --nodes=N --exclusive`
N CPUs spread across N/2 nodes	`--ntasks=N --ntasks-per-node=2`
N CPUs on the same node	`--ntasks=N --ntasks-per-node=N`

Array jobs 

Also called Embarrassingly-Parallel, this set up is commonly used by users that do not have a native parallel application, so they run multiple parallel instances of their application changing its input. Each instance is independent and does not have any kind of communication with others.

To do this, we specify an array using the sbatch parameter --array, multiple values may be specified using a comma-separated list and/or a range of values with a “-” separator (e.g --array=1,3,5-10 or --array=1,2,3). This will be the values that the variable SLURM_ARRAY_TASK_ID is going to take in each array-job.

This input usually refers to these cases:

File input

You have multiple files/directories to process.

In the below example/template we made a “parallel copy” of the files contained in test directory using the cp command.

./test/
├── file1.txt
├── file2.txt
├── file3.txt
├── file4.txt
└── file5.txt

We used one process (called task in Slurm) per each array-job. The array goes from 0 to 4, so there were 5 processes copying the 5 files contained in the test directory.

array-file-input-template.sh

#!/bin/bash

#SBATCH --job-name=array_file_test       # Job name
#SBATCH --mail-type=FAIL,END             # Mail notification
#SBATCH --mail-user=<user>@<domain>      # User Email
#SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid )
#SBATCH --error=slurm-array%J.err        # Stderr (%J expands to GlobalJobid)
#SBATCH --ntasks=1                       # Number of tasks (processes) for each array-job
#SBATCH --time=01:00                     # Walltime for each array-job
#SBATCH --partition=debug                # Partition

#SBATCH --array=0-4    # Array index


##### ENVIRONMENT CREATION #####


##### JOB COMMANDS ####

# Array of files
files=(./test/*)

# Work based on the SLURM_ARRAY_TASK_ID
srun cp ${files[$SLURM_ARRAY_TASK_ID]} copy_$SLURM_ARRAY_TASK_ID

Thus, the generated file copy_0 is the copy of the file test/file1.txt and the file copy_1 is the copy of the file test2.txt and so on. Each one was done by a different Slurm process in parallel.

Warning

Except to --array, ALL other #SBATCH options specified in the submitting Slurm script are used to configure EACH job-array, including ntasks, ntasks-per-node, time, mem, etc.

Parameters input

You have multiple parameters to process.

Similarly to the last example, we created an array with some values that we wanted to use as parameters of the application. We used one process (task) per array-job. We had 4 parameters (0.05 100 999 1295.5) to process and 4 array-jobs.

Force Slurm to run array-jobs in different nodes

To give another feature to this example, we used 1 node for each array-job, so, even knowing that one node can run up to 16 processes (in the case of Cronos) and the 4 array-jobs could be assigned to 1 node, we forced Slurm to use 4 nodes.

To get this we use the parameter --exclusive, thus, for each job-array Slurm will care about not to have other Slurm-job in the same node, even other of your job-array.

Note

Just to be clear, the use of --exclusive as a SBATCH parameter tells Slurm that the job allocation cannot share nodes with other running jobs [4] . However, it has a slightly different meaning when you use it as a parameter of a job-step (each separate srun execution inside a SBATCH script, e.g srun --exclusive $COMMAND). For further information see man srun.

array-params-template.sh

#!/bin/bash

#SBATCH --job-name=array_params_test     # Job name
#SBATCH --mail-type=FAIL,END             # Mail notification
#SBATCH --mail-user=<user>@<domain>      # User Email
#SBATCH --output=slurm-arrayJob%A_%a.out # Stdout (%a expands to stepid, %A to jobid )
#SBATCH --error=slurm-array%J.err        # Stderr (%J expands to GlobalJobid)
#SBATCH --ntasks=1                       # Number of tasks (processes) for each array-job
#SBATCH --time=01:00                     # Walltime for each array-job
#SBATCH --partition=debug                # Partition

#SBATCH --array=0-3    # Array index
#SBATCH --exclusive    # Force slurm to use 4 different nodes

##### ENVIRONMENT CREATION #####


##### JOB COMMANDS ####

# Array of params
params=(0.05 100 999 1295.5)

# Work based on the SLURM_ARRAY_TASK_ID
srun echo ${params[$SLURM_ARRAY_TASK_ID]}

Remember that the main idea behind using Array jobs in Slurm is based on the use of the variable SLURM_ARRAY_TASK_ID.

Note

The parameter ntasks specify the number of processes that EACH array-job is going to use. So if you want to use more, you just can specify it. This idea also applies to all other sbatch parameters.

Note

You can also limit the number of simultaneously running tasks from the job array using a % separator. For example --array=0-15%4 will limit the number of simultaneously running tasks from this job array to 4.

Slurm’s environment variables 

In the above examples, we often used the output of the environment variables provided by Slurm. Here you have a table [3] with the most common variables.

Output environment variables
Variable	Functionality
`SLURM_JOB_ID`	job Id
`SLURM_ARRAY_TASK_ID`	Index of the slurm array
`SLURM_CPUS_PER_TASK`	Same as `--cpus-per-task`
`SLURM_NTASKS`	Same as `-n`, `--ntasks`
`SLURM_JOB_NUM_NODES`	Number of nodes allocated to job
`SLURM_SUBMIT_DIR`	The directory from which `sbatch` was invoked

Slurm’s file-patterns 

sbatch allows filename patterns, this could be useful to name std_err and std_out files. Here you have a table [3] with some of them.

Slurm’s file-patterns
File-patern	Expands to
`%A`	Job array’s master job allocation number
`%a`	Job array ID (index) number
`%j`	jobid of the running job
`%x`	Job name
`%N`	short hostname. This will create a separate IO file per node

Note

If you need to separate the output of a job per each node requested, %N is specially useful, for example in array-jobs.

For instance, if you use #SBATCH --output=job-%A.%a in an array-job the output files will be something like job-1234.1, job-1234.2 , job-1234.3; where: 1234 refers to the job array’s master job allocation number and 1 , 2 and 3 refers to the id of each job-array.

Constraining Features on a job 

In Apolo II, one can specify what type of CPU instruction set to use. One can choose between AVX2 and AVX512. These features can be specify using the SBATCH option --constraint=<list> where <list> is the features to constrain. For example, --constraint="AVX2" will allocate only nodes that have AVX2 in their instruction set. --constraint="AVX2|AVX512" will allocate only nodes that have either AVX512 or AVX2.

One can also have a job requiring some nodes to have AVX2 and some others using AVX512. For this one would use operators ‘&’ and ‘*’. The ampersand works as a ‘and’ operator, and the ‘*’ is used to specify the number of nodes that must comply a single feature. For example, --constraint="[AVX2*2&AVX512*3]" is asking for two nodes with AVX2 and three with AVX512. The squared brackets are mandatory.