Slurm batch jobs – part I
Week 7 – Lecture A
1 Introduction
1.1 Learning goals
This week
As mentioned last week, when you learned about shell scripts:
The end goal is to be able to submit shell scripts as “batch jobs” at OSC, which e.g. allows you to run them simultaneously many times! This is extremely useful because with omics analysis, it’s common to have to run the same step for many samples in parallel.
This week, we’ll cover the remaining piece in being able to run bioinformatics tools efficiently at OSC and beyond: how to submit your shell scripts as batch jobs.
This session
- Different ways to start “compute jobs”: via OnDemand, with interactive jobs, and with Slurm batch jobs
- Strategies around requesting appropriate resources for your compute jobs
- Slurm commands like
sbatchto submit, monitor and manage batch jobs - How to use Slurm options to request specific resources for your compute jobs
1.2 Getting ready
At https://ondemand.osc.edu, start a VS Code session in
/fs/ess/PAS2880/users/$USERCreate a
week07dir and navigate there in the terminalCreate a
scriptsdir withinweek07You’ll need two scripts you made last week. Copy those (or if you somehow don’t have these files, create new files and copy the code from the boxes below):
cp ../../week06/scripts/printname.sh scripts/ cp ../../week06/scripts/fastqc.sh scripts/
printname.sh script (Click to expand)
The code below should be saved in scripts/printname.sh:
#!/bin/bash
set -euo pipefail
first_name=$1
last_name=$2
echo "This script will print a first and a last name"
echo "First name: $first_name"
echo "Last name: $last_name"fastqc.sh script (Click to expand)
The code below should be saved in scripts/fastqc.sh:
#!/bin/bash
set -euo pipefail
# Load the OSC module for FastQC
module load fastqc/0.12.1
# Copy the placeholder variables
fastq="$1"
outdir="$2"
# Initial logging
echo "# Starting script fastqc.sh"
date
echo "# Input FASTQ file: $fastq"
echo "# Output dir: $outdir"
echo
# Create the output dir if needed
mkdir -p "$outdir"
# Run FastQC
fastqc --outdir "$outdir" "$fastq"
# Final logging
echo
echo "# Used FastQC version:"
fastqc --version
echo
echo "# Successfully finished script fastqc.sh"
date2 Compute jobs overview
Automated scheduling software allows hundreds of people with different requirements to effectively and fairly access compute nodes at supercomputers. OSC uses Slurm (Simple Linux Utility for Resource Management) for this.
As you’ve learned, a reservation of resources on compute nodes is called a compute job. Here are the main ways to start a compute job at OSC:
- “Interactive Apps” — Run programs with GUIs (e.g. VS Code or RStudio) in your browser through the OnDemand website.
- Interactive shell jobs — Start an interactive shell on a compute node.
- Batch (non-interactive) jobs — Run a script on a compute node “remotely”: without going to that node yourself.
We’ve already worked a lot with the VS Code Interactive App, and the self-study material at the bottom of this page will cover interactive shell jobs. What we’ll focus on in this session are batch jobs.
3 Basics of Slurm batch jobs
When you submit a batch job, you ask the Slurm scheduler to run a script “out of sight” on a compute node. While that script runs on a compute node, you will stay in your current shell at your current node. After you submit a batch job, it will continue to run even if you log off from OSC and shut down your computer.
3.1 The sbatch command
You can use Slurm’s sbatch command to submit a batch job. But first, let’s recall how we’ve run shell scripts so far:
bash scripts/printname.sh Jane DoeThis script will print a first and a last name
First name: Jane
Last name: Doe
The above command ran the script on whatever node you are on, and printed output to the screen. To instead submit the script to the Slurm queue, start by simply replacing bash with sbatch:
sbatch scripts/printname.sh Jane Doesrun: error: ERROR: Job invalid: Must specify account for job
srun: error: Unable to allocate resources: Unspecified error
However, as the above error message –“Must specify account for job”– informs us, you need to indicate which OSC Project (or as Slurm puts it, “account”) you want to use for this compute job. Use the --account= option to do this:
sbatch --account=PAS2880 scripts/printname.sh Jane DoeSubmitted batch job 12431935
This output line means your job was successfully submitted (no further job output will be printed to your screen — more about that below). The job has a unique identifier among all compute jobs by all users at OSC, and you can use this number to monitor and manage it. Each of us will therefore see a different job number pop up.
After submitting a batch job, you immediately get your prompt back. The job will run outside of your immediate view, and you can continue doing other things in the shell while it does, or even log off. This behavior allows you to submit many jobs at the same time: you don’t have to wait for other jobs to finish, or even to start!
sbatch options and script arguments
As was implicit in the command above, we can use sbatch options and script arguments in one command like so:
sbatch [sbatch-options] myscript.sh [script-arguments]Depending on the details of the script itself, any combination of sbatch options and script arguments is possible:
# [Don't run this - hypothetical examples]
sbatch scripts/printname.sh # No options/arguments for either
sbatch scripts/printname.sh Jane Doe # Script arguments but no sbatch option
sbatch --account=PAS2880 scripts/printname.sh # sbatch option but no script arguments
sbatch --account=PAS2880 scripts/printname.sh Jane Doe # Both sbatch option and script argumentsJust make sure you use the correct order, e.g. don’t type sbatch options after the name of the script. (Also, it is possible to omit the --account option, as shown above, when you specify this option inside the script. We’ll see this later.)
3.2 Where does the script’s output go?
Above, we saw that when you ran printname.sh “directly” with bash, the script’s output was printed to the screen, whereas when you submitted it as a batch job, only Submitted batch job <job-number> was printed to screen. In the latter case, where did this output go?
It ended up in a file called slurm-<job-number>.out (e.g., slurm-12431942.out; since each job number is unique to a given job, each file has a different number). We’ll call this type of file a Slurm log file.
Getting this output in log files instead of printed to screen may seem inconvenient. Can you think of any reasons why we may not want batch job output printed to screen, even if it were possible? (Click for the answer)
There are several reasons, such as:
- If you log off after submitting a batch job, any output printed to screen would be lost.
- The power of submitting batch jobs is that you can submit many at once — e.g. one per sample, running the same script. If the output from all those scripts ends up on your screen, things become a big mess, and you have no lasting record of what happened.
If you run ls, you should see a Slurm log file for the job you just submitted:
lsscripts slurm-12431935.out
Let’s take a look at its contents:
cat slurm*This script will print a first and a last name
First name: Jane
Last name: Doe
This file contains the script’s output that was printed to screen when we ran it with bash – nothing more or less1.
When the scripts in batch jobs are being run, they start in the directory that they were submitted from: that is, the working directory remains the same, and you shouldn’t have to make special adjustments to paths.
Additionally, as you’ve seen, Slurm log files will (by default) be created in the dir you submitted the job from.
Two types of output files
There is an important distinction between two general types of output that scripts, commands, and programs may have:
- Output that is printed to screen.
The technical terms for such output are “standard output” for non-error output and “standard error” for error output.- We’ve seen that most Unix commands by default print their output to the screen.
- Other programs, like bioinformatics tools, will commonly print “logging”-type output to the screen, like you’ve seen with FastQC. But some programs will by default also print their main results to the screen. (This can sometimes be changed with a program’s options, and otherwise, you can always redirect (
>) output to a file.)
- Output that is written to files.
To summarize what happens to these when you submit a script as a batch job instead of running it directly:
- A script’s standard out and standard error will be written to a file when you submit the script with
sbatch - Output of commands written to output files inside the script (either via redirection or otherwise) will end up in the exact same files regardless of how you run the script.
Your printline.sh script only had the first type of output, but scripts typically have both, and we’ll see examples of that below.
Cleaning up the Slurm logs
When submitting batch jobs, your working dir can easily become a confusing mess of anonymous-looking Slurm log files. These two strategies help to prevent this:
- Changing the default Slurm log file name to include a one- or two-word description of the job/script (see below).
- Cleaning up your Slurm log files, by:
- Removing them when no longer needed — as is e.g. appropriate for our current Slurm log file.
- Moving them to the same location as other outputs by that script. This is often appropriate after you’ve run a bioinformatics tool, since the Slurm log file may contain some info you’d like to keep. For example, you can move Slurm log files for jobs that ran FastQC, and produced outputs in
results/fastqc, to a dirresults/fastqc/logs.
In this case, we’ll simply remove the Slurm log file, as it has no information that we need to keep:
rm -v slurm*removed slurm-12431935.out
3.3 Adding sbatch options in scripts
The --account= option is just one of many options you can use when submitting a compute job, but is the only required one. This is because defaults exist for all other options, such as the amount of time (1 hour) and the number of cores (1 core).
Instead of adding these options after the sbatch command when submitting the script, you can also add them inside the script. This is a useful alternative because:
- You’ll often want to specify several options, which could otherwise lead to very long
sbatchcommands. - It allows you to store a script’s typical Slurm options as part of the script, so you don’t have to remember them.
These options are added in the script using another type of special comment line (akin to the shebang #!/bin/bash line) that is marked by #SBATCH. Just like the shebang line, #SBATCH line(s) should be located at the top of the script.
Let’s add one such line to the printname.sh script, such that the first few lines read:
#!/bin/bash
#SBATCH --account=PAS2880
set -euo pipefailSo, the equivalent of adding --account=PAS2880 after sbatch on the command line is a line in your script that reads #SBATCH --account=PAS2880.
Now, you are able to run the sbatch command without options (which failed earlier):
sbatch scripts/printname.sh Jane DoeSubmitted batch job 12431942
sbatch option precedence!
Any sbatch option provided on the command line will override the equivalent option provided inside the script. This is sensible because it allows you to provide “defaults” inside the script, and change one or more of those when needed “on the go” on the command line.
#SBATCH lines elsewhere
Because #SBATCH lines are special comment lines, they will simply be ignored (and not throw any errors) when you run a script with such lines in other contexts: for example, when not running it as a batch job at OSC, or even when running it on a computer without Slurm installed.
4 Monitoring batch jobs
Real batch jobs for your research projects may run for a while, You may also be submitting many jobs at once. Finally, longer-running jobs and those that ask for many cores sometimes remain queued for a while before they start. For these reasons, it’s important to know how you can monitor your batch jobs.
4.1 A sleepy script for practice
We’ll use another short shell script to practice monitoring and managing batch jobs. First create a new file:
touch scripts/sleep.shOpen the file in the VS Code editor and copy the following into it:
#!/bin/bash
#SBATCH --account=PAS2880
echo "I will sleep for 30 seconds" > sleep.txt
sleep 30s
echo "I'm awake! Successfully finished script sleep.sh"Exercise: Batch job output recap
Predict what would happen if you submit the sleep.sh script as a batch job using sbatch scripts/sleep.sh:
- How many output files will this batch job produce?
- What will be in each of those files?
- In which directory will the file(s) appear?
- In terms of output files, what would be different if we instead run the script using
bash scripts/sleep.sh?
Then, test your predictions by running the script.
Click for the solutions
The job will produce 2 files:
slurm-<job-number>.out: The Slurm log file, containing output normally printed to screen.sleep.txt: Containing output that was redirected to this file in the script.
The those files will contain the following:
slurm-<job-number>.out: I’m awake! Done with script sleep.shsleep.txt: “I will sleep for 30 seconds”
Both files will end up in your current working directory. Slurm log files always go to the directory from which you submitted the job. Slurm jobs also run from the directory from which you submitted your job, and since we redirected the output simply to
sleep.txt, that file was created in our working directory.If we had run the script directly,
sleep.txtwould have also been created with the same content, but “All done!” would have been printed to screen.
Run the script and check the outputs:
sbatch scripts/sleep.shSubmitted batch job 27935840
cat sleep.txtI will sleep for 30 seconds
cat slurm-27935840.outI'm awake! Successfully finished script sleep.sh
4.2 Checking a job’s status
Batch job behavior
After you submit a job, it may initially be waiting to get resources allocated to it: in other words, the job may be pending. Eventually, and often very quickly, the job will start running. You’ve seen this process with the VS Code Interactive App job as well.
Whereas Interactive App jobs will keep running until they’ve reached the end of the allocated time2, batch jobs will stop as soon as the script has finished. And if the script is still running when the job runs out of its allocated time, it will be killed (stopped) right away.
The squeue command
You can check the status of your batch jobs using the squeue Slurm command – try the following:
squeue -u $USER -lThu Apr 4 15:47:51 2025
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
23640814 condo-osu ondemand jelmer RUNNING 6:34 2:00:00 1 p0133
In the command above:
- You specify your username with the
-uoption (without this, you’d see everyone’s jobs!). In this example, I used the environment variable$USERto get your user name, but you can simply type your username if that’s easier. - The option
-l(lowercase L, “long”) will produce the more verbose output shown. Without it, the output will be a bit more cryptic.
In the squeue output, following a line with the date & time and a header line, you should see information about a single compute job, as shown above: this is the Interactive App job that runs VS Code – that’s not a batch job, but it is a compute job, and all compute jobs are listed.
The following pieces of information about each job are listed:
| Column | Explanation |
|---|---|
JOBID |
The job ID number |
PARTITION |
The type of queue (usually auto-assigned and not of interest) |
NAME |
The name of the job (by default the name of the script when submitting a script) |
USER |
The username of the person who submitted the job |
STATE |
The job’s state, usually either PENDING or RUNNING – Finished jobs do not appear |
TIME |
For how long the job has been running (here in “minutes:seconds” format) |
TIME_LIMIT |
The amount of time you reserved for the job (here in “hours:minutes:seconds” format) |
NODES |
The number of nodes reserved for the job |
NODELIST(REASON) |
- When running: the ID of the node on which it is running. - When pending: why the job is pending |
squeue example
Now, let’s see a batch job in the squeue listing. Start by submitting the sleep.sh script as a batch job:
sbatch scripts/sleep.shSubmitted batch job 12431945
If you’re quick enough, you may be able to catch the job’s STATE as PENDING before it starts:
squeue -u $USER -lThu Apr 4 15:48:26 2025
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
12520046 serial-40 sleep.sh jelmer PENDING 0:00 1:00:00 1 (None)
23640814 condo-osu ondemand jelmer RUNNING 7:12 2:00:00 1 p0133
But soon enough it should read RUNNING:
squeue -u $USER -lThu Apr 4 15:48:39 2025
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
12520046 condo-osu sleep.sh jelmer RUNNING 0:12 1:00:00 1 p0133
23640814 condo-osu ondemand jelmer RUNNING 8:15 2:00:00 1 p0133
The script should finish after 30 seconds (because your command was sleep 30s), after which the job will immediately disappear from the squeue listing, because only pending and running jobs are shown:
squeue -u $USER -lMon Aug 21 15:49:26 2025
JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON)
23640814 condo-osu ondemand jelmer RUNNING 9:02 2:00:00 1 p0133
Checking a job’s output files
Whenever you’ve ran a script as a batch job, even if you’ve been monitoring it with squeue, you should also make sure it ran successfully. You can do this by checking the output file(s) – as mentioned above, you’ll usually have two types of output from a batch job:
- A Slurm log file with the script’s standard output and standard error, which would have been printed to screen if the job hadn’t been submitted with
sbatch(typically: logging-type output and errors) - Output file(s) created inside the script (typically: the main results)
And as you saw in the exercise above, this is also the case for the output of our sleepy script:
The output file that the code in the script directly produced:
cat sleep.txtI will sleep for 30 secondsThe Slurm log file:
cat slurm-12520046.outI'm awake! Done with script sleep.sh
Let’s keep things tidy and remove the script’s outputs:
rm slurm* sleep.txtIf you delete a Slurm log file for a job that is still running, the file will not be recreated when the job produces more logging output later on. That means that if you accidentally do this and the logging output is of key importance to interpreting other outputs, or making sure it ran successfully, you are better off canceling the job entirely and trying again. 😕
4.3 Canceling jobs
Sometimes, you want to cancel one or more jobs. For example, you may realize you made a mistake in the script or used the wrong input files as arguments. You can cancel jobs that are either pending or running using scancel:
# [Examples - DON'T run this: the second line would cancel your VS Code job]
# Cancel a specific job:
scancel 2979968
# Cancel all your running and queued jobs (careful with this!):
scancel -u $USERUse
squeue’s-toption to restrict the type of jobs you want to show. For example, to only show running and not pending jobs:squeue -u $USER -t RUNNINGYou can see more details about any running or finished job, including the amount of time it ran for:
scontrol show job <jobID>UserId=jelmer(33227) GroupId=PAS0471(3773) MCS_label=N/A Priority=200005206 Nice=0 Account=PAS2880 QOS=pitzer-default JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 RunTime=00:02:00 TimeLimit=01:00:00 TimeMin=N/A SubmitTime=2020-12-14T14:32:44 EligibleTime=2020-12-14T14:32:44 AccrueTime=2020-12-14T14:32:44 StartTime=2020-12-14T14:32:47 EndTime=2020-12-14T15:32:47 Deadline=N/A SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-14T14:32:47 Partition=serial-40core AllocNode:Sid=pitzer-login01:57954 [...]Update directives for a job that has already been submitted (this can only be done before the job has started running):
scontrol update job=<jobID> timeLimit=5:00:00Hold and release a pending (queued) job – this could e.g. be useful when you need to update an input file before the job starts running:
scontrol hold <jobID> # Job won't start running until released scontrol release <jobID> # Job is free to start
5 Recap and what’s next
In this lecture, you’ve learned the basics of submitting scripts as Slurm batch jobs with the sbatch command, including:
- How to specify
sbatchoptions on the command-line or inside the script - The basic behavior and outputs associated with batch jobs
- How to check the job queue and monitor your jobs
In the next lecture, you will learn other commonly-used sbatch options so you can reserve more time, cores, etc. for your job. You will also see some more practical examples of running batch jobs.
Footnotes
Unless you explicitly instruct Slurm to print regular output (“standard output”) and error messages (“standard error”) to separate files — see the box in the section on Slurm log files for more.↩︎
Unless you actively “Delete” the job on the Ondemand website.↩︎