Job Scheduling & Resource Allocation¶
An HPC cluster needs a way for users to access its computational capacity in a fair and efficient manner. It does this using a scheduler. The scheduler takes user requests in the form of jobs and allocates resources to these jobs based on availability and cluster policy.
The Easley cluster uses the Slurm scheduler. Slurm is a proven job scheduler used by many of the top universities and research institutions in the world. It is open source, fault-tolerant, and highly scalable. Slurm is not the same scheduler used on the Hopper cluster. The Hopper cluster uses Moab/Torque. Both schedulers basically do the same thing, but implement it differently using their own distinct commands and terminology.
Job Submission¶
Job submission is the process of requesting resources from the scheduler. It is the gateway to all the computational horsepower in the cluster. Users submit jobs to tell the scheduler what resources are needed and for how long. The scheduler then evaluates the request according to resource availability and cluster policy to determine when the job will run and which resources to use.
How to Submit a Job¶
Job submission uses the Slurm ‘sbatch’ command. This command includes numerous directives which are used to specify resource requirements and other job attributes. Slurm directives can be in a job script as header lines (#SBATCH), as command-line options to the sbatch command or a combination of both. If both, the command-line option takes precedence.
The general form of the sbatch command:
sbatch [OPTIONS(0)...] [ : [OPTIONS(N)...]] script(0) [args(0)...]
Ex.
sbatch -N1 -t 4:00:00 myScript.sh
cat myScript.sh
#!/bin/bash
#SBATCH --job-name=myJob # job name
#SBATCH --ntasks=10 # number of tasks across all nodes
#SBATCH --partition=general # name of partition to submit job
#SBATCH --time=01:00:00 # Run time (D-HH:MM:SS)
#SBATCH --output=job-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=job-%j.err # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
...
This job submission requests one node (N1) and a walltime of 4 hr (4:00:00) as command-line options. Other options are specified in the job script as sbatch directives.
Common Slurm Job Submission Options¶
Description |
Long Option |
Short Option |
Default |
Moab/Torque |
---|---|---|---|---|
Job Name |
–job-name |
-J |
name of job script |
-N |
Time limit for job in D-HH:MM:SS |
–time |
-t |
2-00:00:00 |
-l walltime |
Number of Nodes requested |
–nodes |
-N |
1 |
-l nodes |
Number of processors |
–ntasks |
-n |
1 |
|
Partition |
–partition |
-p |
general |
-q |
Job Array |
–array |
-a |
-t |
|
Output File |
–output |
-o |
-o |
|
Error File |
–error |
-e |
-e |
|
Memory |
–mem=[M,G] |
-l mem=[MB,GB] |
Default Memory per Partition¶
If you do not specify the amount of memory for a job, the job will receive the default memory provided by the scheduler. The default memory for each partition is listed below
Partition |
Default Memory |
Total |
---|---|---|
General |
3 GB |
192 GB |
Bigmem2 |
7 GB |
384 GB |
Bigmem4 |
15 GB |
768 GB |
Amd |
1 GB |
256 GB |
Gpu2 |
7 GB |
384 GB |
Gpu4 |
15GB |
768 GB |
Slurm Job States¶
Jobs will pass through several states during the course of their submission and execution. The following job state codes listed below are the most common codes along with their abbreviation and description.
Job State |
Code |
Description |
---|---|---|
Pending |
PD |
Job is awaiting resource allocation |
Running |
R |
Job currently has an allocation and is running |
Completing |
CG |
Job is in the process of completing. Some processes may still be acitve |
Cancelled |
CA |
Job was cancelled by user or admin |
Failed |
F |
Job terminated with failure |
Stopped |
ST |
Job has an allocation but execution has been stopped |
Configuring |
CF |
Job has been allocated resources but are waiting for them to become available |
Partitions¶
In Slurm, the concept of partitions is important in job submission. A partition is used to logically group different types of capacity and provide them with special functionality. In Easley, there are high-level partitions based on the node type: general, bigmem2, bigmem4, amd, gpu2 and gpu4. The general partition consists of 126 standard nodes, the bigmem2 partition consists of 21 bigmem2 nodes, and so on as defined in the Locations and Resources section. All users can use these high-level partitions on a first-come,first-served basis. However, there is no priority access to these partitions.
There are also partitions based on a PI’s purchased capacity. Only the PI and their sponsored accounts can use these partitions. Not only do they have exclusive access to them, but they also have priority access. Jobs submitted with a PI partition will preempt, if needed, any job running on the same capacity not using that PI partition. Note that the capacity in the PI partitions overlaps the capacity in the high-level partitions in a one-to-one fashion.
To illustrate, let’s say that nodeX is in the general partition. For sake of example, let’s say this same nodeX is also in the PI partition ‘mylab_std’. So both partitions contain, or overlap, nodeX. A user who does not have access to the ‘mylab_std’ partition submits job A using the general partition and it runs on nodeX. Later a user who does have access to the ‘mylab_std’ partition submits job B using the ‘mylab_std’ partition. Since job B uses the ‘mylab_std’ partition that has priority access, it preempts job A and runs on nodeX. Job A is requeued and waits for available resources in order to run.
Partition Types¶
Type |
Priority |
Availability |
Preemption |
Example |
---|---|---|---|---|
Dedicated |
1 |
Lab group |
Cannot be preempted. |
hpcadmin_std |
Department |
2 |
Department members |
Can be preempted. |
chen_bg2 |
Investor |
3 |
Investors in special capacity |
Can be preempted. |
investor_amd |
Community |
4 |
All Easley users |
Can be preempted. |
general |
Partition Commands¶
To view all available partitions on the cluster, use the sinfo command. This command can also be used to find out information such as the number of available nodes,cpus per node, and walltime.
User Command |
Slurm Command |
|
---|---|---|
Show partition information |
sinfo |
|
Show nodes(idle) |
sinfo -t idle |
|
Show nodes(allocated) |
sinfo -t alloc |
|
Show nodes(by partition) |
sinfo -p partition name |
|
Show max cpus per node |
sinfo -o%c -p partition name |
Monitor Jobs¶
To display information about active, eligible and blocked jobs, use the squeue command:
squeue
Option |
Description |
---|---|
squeue -l |
Displays all jobs |
squeue -r |
Displays running jobs |
sinfo -t alloc |
Displays nodes allocated to jobs |
squeue –start -j <jobid> |
Displays the estimated time a job will begin |
To display detailed job state information and diagnostic output for a specified job, use the scontrol show job <job id> command:
scontrol show job <job id>
To cancel a job:
scancel <job id>
To prevent a pending job from starting:
scontrol hold <job id>
To release a previously held job:
scontrol release <job id>
Monitor Resources¶
All jobs require resources to run. This includes memory and cores on compute nodes as well as resources like file system space for output files. These commands help determine what resources are available for your jobs.
To check the status of a your dedicated capacity.
my_capacity
To display idle capacity by partition.
sinfo -t idle
To display pending jobs on a specific partition.
squeue -t PD -p <partition>
To check your disk space usage.
checkquota
To see if you have files that are scheduled to expire soon
expiredfiles
Testing¶
Interactive Job Submission¶
Interactive jobs may assist with troubleshooting and testing performance. Typing in the following will log you into a shell on a compute node:
srun --pty /bin/bash
You can also specify the resources needed
srun -N1 -n1 --time=01:00 --pty bash
Here we are requesting one core on a single node to run our job interactively. Next you will need to check and make sure the necessary modules needed for the job are loaded
module list
Load any additional modules needed before running the program
module load samtools
You can exit the interactive session by typing in the following
exit
Job Sub Examples¶
Command-Line Examples¶
Example 1:
This job submission requests 40 processors on two nodes for the job ‘test.sh’ and 20 hr of walltime. It will also email ‘nouser’ when the job begins and ends or if the job is aborted. Since no partition is specified, the general partition is used as it is the default.
sbatch -N2 -n40 -t20:00:00 --mail-type=begin,end,fail --mail-user=nosuer@auburn.edu test.sh
Example 2:
This job requests a node with 200MB of available memory in the general partition. Since no walltime is indicated, the job will get the default walltime.
sbatch -pgeneral --mem=200M <job script>
SBATCH Examples¶
Serial Job Submission
For jobs that require only one CPU-core…
#!/bin/bash
#SBATCH --job-name=testJob # job name
#SBATCH --nodes=1 # node(s) required for job
#SBATCH --ntasks=1 # number of tasks across all nodes
#SBATCH --partition=general # name of partition
#SBATCH --time=01:00:00 # Run time (D-HH:MM:SS)
#SBATCH --output=test-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
Multithread Job Submission
For jobs that require the use of multiple cores
Ex.
#!/bin/bash
#SBATCH --job-name=testJob # job name
#SBATCH --nodes=1 # node(s) required for job
#SBATCH --ntasks=10 # number of tasks across all nodes
#SBATCH --partition=general # name of partition to submit job
#SBATCH --time=01:00:00 # Run time (D-HH:MM:SS)
#SBATCH --output=test-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err # Error file. %j is replaced with job ID
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
In this case, 10 cores will be allocated on one node. Note: if you do not specify the node count = 1, the cores may be allocated accross multiple nodes. Especially if they exceed the amount of cores available on one node.
Multinode Job Submission
For jobs that require the use of multiple nodes and multiple cores.
#!/bin/bash
#SBATCH --job-name=testJob # job name
#SBATCH --nodes=2 # node(s) required for job
#SBATCH --ntasks-per-node=10 # number of tasks per node
#SBATCH --partition=general # name of partition to submit job
#SBATCH --output=test-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err # Error file. %j is replaced with job ID
#SBATCH --time=01:00:00 # Run time (D-HH:MM:SS)
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
In this case, 20 cores will be allocated, 10 task per node.
GPU Job Submission
#!/bin/bash
#SBATCH --job-name=testJob # job name
#SBATCH --nodes=1 # node(s) required for job
#SBATCH --ntasks=1 # number of tasks across all nodes
#SBATCH --partition=gpu2 # name of partition to submit job(gpu2 or gpu4)
#SBATCH --gres=gpu:tesla:1 # specifies the number of gpu devices needed
#SBATCH --output=test-%j.out # Output file. %j is replaced with job ID
#SBATCH --error=test_error-%j.err # Error file. %j is replaced with job ID
#SBATCH --time=01:00:00 # Run time (D-HH:MM:SS)
#SBATCH --mail-type=ALL # will send email for begin,end,fail
#SBATCH --mail-user=user@auburn.edu
Note: For more information relating to gpu, consult our GPU Quick Start Section of the documentation
Job Arrays¶
Job arrays are useful for submitting and managing a large number of similar jobs. As an example, job arrays are convenient if a user wishes to run the same analysis on 100 different files. Slurm provides job array environment variables that allow multiple versions of input files to be easily referenced.
A job array can be submitted by adding the following to an sbatch submission
sbatch --array=0-4 job_script.sh
Where 0-4 specifies the array length. You can also create the array length within your script
#!/bin/bash
#SBATCH --job-name=Array
#SBATCH --output=array-%A.txt
#SBATCH --error=array-%A.txt
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --array=0-4
Then submit the following
sbatch job_script.sh
Naming output and error files¶
In order to produce output and error files for each array task, you will need to specify both the job ID and task ID. Slurm uses %A for the master job ID and %a for the task ID.
#SBATCH --output=Array-%A_%a.out
#SBATCH --error=Array-%A_%a.error
The result will be the following
Array-JOBID_1.txt
Array-JOBID_2.txt
Array-JOBID_3.txt
Array-JOBID_4.txt
Note: If you only use %A, all array tasks will write to a single file.
Deleting job arrays and tasks¶
To delete all array tasks, use scancel with the job ID:
scancel JOBID
To delete a single array task, specify the task ID:
scancel JOBID_1