Shell Scripting
Shell scripts (or to be slightly more precise, Bash scripts) enable us to run sets of commands non-interactively. This is especially beneficial or necessary when a set of commands:
- Takes a long time to run and/or
- Should be run many times, e.g. for different samples
Scripts form the basis for analysis pipelines and if we code things cleverly, it should be straightforward to rerun much of our project workflow:
- After removing or adding a sample
- For different parameter settings
- And possibly even for an entirely different dataset.
1 Setup
2 Script header lines and zombie scripts
2.1 Shebang line
We use a so-called “shebang” line as the first line of a script to indicate which language our script uses. More specifically, this line tell the computer where to find the binary (executable) that will run our script.
Such a line starts with #!
, basically marking it as a special type of comment. After that, we provide the location to the relevant program: in our case Bash, which is located at /bin/bash
on Linux and Mac computers.
#!/bin/bash
Adding a shebang line is good practice in general, and is necessary when we want to submit our script to OSC’s Slurm queue, which we’ll do tomorrow.
2.2 Bash script settings
Another line that is good practice to add to your Bash scripts changes some default settings to safer alternatives. The following two Bash default settings are bad ideas inside scripts:
First, and as we’ve seen in the previous module, Bash does not complain when you reference a variable that does not exist (in other words, it does not consider that an error).
In scripts, this can lead to all sorts of downstream problems, because you very likely tried and failed to do something with an actual variable. Even more problematically, it can lead to potentially very destructive file removal:
# Using a variable, we try to remove some temporary files whose names start with tmp_
temp_prefix="temp_"
rm "$tmp_prefix"* # DON'T TRY THIS!
# Using a variable, we try to remove a temporary directory
tempdir=output/tmp
rm -rf $tmpdir/* # DON'T TRY THIS!
Second, a Bash script keeps running after encountering errors. That is, if an error is encountered when running line 2 of a script, any remaining lines in the script will nevertheless be executed.
In the best case, this is a waste of computer resources, and in worse cases, it can lead to all kinds of unintended consequences. Additionally, if your script prints a lot of output, you might not notice an error somewhere in the middle if it doesn’t produce more errors downstream. But the downstream results from what we at that point might call a “zombie script” may still be completely wrong.
The following three settings will make your Bash scripts more robust and safer. With these settings, the script terminates, with an appropriate error message, if:
set -u
— An unset (non-existent) variable is referenced.set -e
— Almost any error occurs.set -o pipefail
— An error occurs in a shell “pipeline” (e.g.,sort | uniq
).
We can change all of these settings in one line in a script:
set -u -e -o pipefail # (For in a script - don't run in the terminal)
Or even more concisely:
set -ueo pipefail # (For in a script - don't run in the terminal)
2.3 Our header lines as a rudimentary script
Let’s go ahead and start a script with the header lines that we have so far discussed.
Inside your personal directory within
/fs/ess/scratch/PAS2250/participants
, make a directory calledscripts
and one calledsandbox
(e.g.mkdir scripts sandbox
, or use the VS Code menus.Open a new file in the
VS Code
editor ( =>File
=>New File
) and save it asprintname.sh
within the newly createdscripts
dir.Type the following lines in that script (not in your terminal!):
#!/bin/bash set -ueo pipefail
Already now, we could run (execute) the script. One way of doing this is calling the bash
command followed by the name of the script2:
bash scripts/printname.sh
Doing this won’t print anything to screen (or file). Since our script doesn’t have any output, that makes sense — no output can be a good sign, because it means that no errors were encountered.
3 Command-line arguments for scripts
3.1 Calling a script with arguments
When you call a script, you can pass it command-line arguments, such as a file to operate on.
This is much like when you provide a command like ls
with arguments:
# Run ls without arguments:
ls
# Pass 1 filename as an argument to ls:
ls data/sampleA.fastq.gz
# Pass 2 filenames as arguments to ls, separated by spaces:
ls data/sampleA.fastq.gz data/sampleB.fastq.gz
Let’s see what this would look like with our printname.sh
script and a fictional script fastqc.sh
:
# Run scripts without any arguments:
bash fastqc.sh # (Fictional script)
bash scripts/printname.sh
# Run scripts with 1 or 2 arguments:
bash fastqc.sh data/sampleA.fastq.gz # 1 argument, a filename
bash scripts/printname.sh John Doe # 2 arguments, strings representing names
In the next section, we’ll see what happens when we pass arguments to a script on the command line.
3.2 Placeholder variables
Inside the script, any command-line arguments are automatically available in placeholder variables.
A first argument will be assigned to the variable $1
, any second argument will be assigned to $2
, any third argument will be assigned to $3
, and so on.
Let’s add code to our printname.sh
script to “process” any first and last name that are passed to the script as command-line arguments. For now, our script will simply echo
the placeholder variables, so that we can see what happens:
#!/bin/bash
set -ueo pipefail
echo "First name: $1"
echo "Last name: $2"
# (Note: this is a script. Don't enter this directly in your terminal.)
Next, we’ll run the script, passing the arguments John
and Doe
:
bash scripts/printname.sh John Doe
First name: John
Last name: Doe
On Your Own: Command-line arguments
In each case below, think about what might happen before you run the script. Then, run it, and if you didn’t make a successful prediction, try to figure out what happened instead.
Run the script (
scripts/printname.sh
) without passing arguments to it.Deactivate (“comment out”) the line with
set
settings by inserting a#
as the first character. Then, run the script again without passing arguments to it.Double-quote
John Doe
when you run the script, i.e. runbash scripts/printname.sh "John Doe"
To get back to where we were, remove the #
you inserted in the script in step 2 above.
3.3 Descriptive variable names
While you can use the $1
-style placeholder variables throughout your script, I find it very useful to copy them to more descriptively named variables as follows:
#!/bin/bash
set -ueo pipefail
first_name=$1
last_name=$2
echo "First name: $first_name"
echo "Last name: $last_name"
# (Note: this is a script. Don't enter this directly in your terminal.)
Using descriptively named variables in your scripts has several advantages. It will make your script easier to understand for others and for yourself. It will also make it less likely that you make errors in your script in which you use the wrong variable in the wrong place.
On Your Own: A script to print a specific line
Write a script that prints a specific line (identified by line number) from a file.
- Open a new file and save it as
scripts/printline.sh
- Start with the shebang and
set
lines - Your script takes two arguments: a file name (
$1
) and a line number ($2
) - Copy the
$1
and$2
variables to descriptively named variables - To print a specific line, think how you might combine
head
andtail
to do this. If you’re at a loss, feel free to check out the top solution box. - Test the script by printing line 4 from
data/meta/meta.tsv
.
4 Script variations and improvements
4.1 A script to serve as a starting point
We’ve learned that the head
command prints the first lines of a file, whereas the tail
command prints the last lines. Sometimes it’s nice to be able to quickly see both ends of a file, so let’s write a little script that can do that for us, as a starting point for the next few modifications.
Open a new file, save it as scripts/headtail.sh
, and add the following code to it:
#!/bin/bash
set -ueo pipefail
input_file=$1
head -n 2 "$input_file"
echo "---"
tail -n 2 "$input_file"
# (Note: this is a script. Don't enter this directly in your terminal.)
Next, let’s run our headtail.sh
script:
bash scripts/headtail.sh data/meta/meta.tsv
accession location treatment replicate nreads_raw pct_mapped
SRR7609473 beach control 1 45285752 76.01
---
SRR7609467 inland treatment 2 47303936 79.25
SRR7609474 inland treatment 3 55728624 78.80
4.2 Redirecting output to a file
So far, the output of our scripts was printed to screen, e.g.:
- In
printnames.sh
, we simplyecho
’d, inside sentences, the arguments passed to the script. - In
headtail.sh
, we printed the first and last few lines of a file.
All this output was printed to screen because that is the default output mode of Unix commands, and this works the same way regardless of whether those commands are run directly on the command line, or are run inside a script.
Along those same lines, we have already learned that we can “redirect” output to a file using >
(write/overwrite) and >>
(append) when we run shell commands — and this, too, works exactly the same way inside a script.
When working with genomics data, we commonly have files as input, and new/modified files as output. Let’s practice with this and modify our headtail.sh
script so that it writes output to a file.
We’ll make the following changes:
We will have the script accept a second argument: the output file name3.
We will redirect the output of our
head
,echo
, andtail
commands to the output file. We’ll have to append (using>>
) in the last two commands.
#!/bin/bash
set -ueo pipefail
input_file=$1
output_file=$2
head -n 2 "$input_file" > "$output_file"
echo "---" >> "$output_file"
tail -n 2 "$input_file" >> "$output_file"
# (Note: this is a script. Don't enter this directly in your terminal.)
Now we run the script again, this time also passing the name of an output file:
bash scripts/headtail.sh data/meta/meta.tsv sandbox/samples_headtail.txt
The script will no longer print any output to screen, and our output should instead be in sandbox/samples_headtail.txt
:
# Check that the file exists and was just modified:
ls -lh sandbox/samples_headtail.txt
-rw-rw-r-- 1 jelmer jelmer 197 Aug 24 20:58 sandbox/samples_headtail.txt
# Print the contents of the file to screen
cat sandbox/samples_headtail.txt
accession location treatment replicate nreads_raw pct_mapped
SRR7609473 beach control 1 45285752 76.01
---
SRR7609467 inland treatment 2 47303936 79.25
SRR7609474 inland treatment 3 55728624 78.80
4.3 Report what’s happening
It is often useful to have your scripts “report” or “log” what is going on. Let’s keep thinking about a script that has file(s) as the main output, but instead of having no output printed to screen at all, we’ll print some logging output to screen. For instance:
- What is the date and time
- Which arguments were passed to the script
- What are the output files
- Perhaps even summaries of the output.
All of this can help with troubleshooting and record-keeping.4 Let’s try this with our headtail.sh
script.
#!/bin/bash
set -ueo pipefail
## Copy placeholder variables
input_file=$1
output_file=$2
## Initial logging
echo "Starting script $0" # Print name of script
date # Print date & time
echo "Input file: $input_file"
echo "Output file: $output_file"
echo # Print empty line to separate initial & final logging
## Print the first and last two lines to a separate file
head -n 2 "$input_file" > "$output_file"
echo "---" >> "$output_file"
tail -n 2 "$input_file" >> "$output_file"
## Final logging
echo "Listing the output file:"
ls -lh "$output_file"
echo "Done with script $0"
date
# (Note: this is a script. Don't enter this directly in your terminal.)
A couple of notes about the lines that were added to the script above:
Printing the
date
at the end of the script as well will allow you to check for how long the script ran, which can be informative for longer-running scripts.Printing the input and output files (and the command-line arguments more generally) can be particularly useful for troubleshooting
We printed a “marker line” like
Done with script
, indicating that the end of the script was reached. This is handy due to ourset
settings: seeing this line printed means that no errors were encountered.I also added some comment headers like “Initial logging” to make the script easier to read, and such comments can be made more extensive to really explain what is being done.
Let’s run the script again:
bash scripts/headtail.sh data/meta/meta.tsv sandbox/tmp.txt
Starting script scripts/headtail.sh
Wed Aug 24 08:58:53 PM CEST 2022
Input file: data/meta/meta.tsv
Output file: sandbox/tmp.txt
Listing the output file:
-rw-rw-r-- 1 jelmer jelmer 197 Aug 24 20:58 sandbox/tmp.txt
Done with script scripts/headtail.sh
Wed Aug 24 08:58:53 PM CEST 2022
The script printed some details for the output file, but not its contents (that would have worked here, but is usually not sensible when working with genomics data). Let’s take a look, though, to make sure the script worked:
cat sandbox/tmp.txt # "cat" prints all of a file's contents
accession location treatment replicate nreads_raw pct_mapped
SRR7609473 beach control 1 45285752 76.01
---
SRR7609467 inland treatment 2 47303936 79.25
SRR7609474 inland treatment 3 55728624 78.80
On Your Own: A fanciful script
Modify your printline.sh
script to:
- Redirect output to a file
- This output file should not be “hardcoded” in the script, but its name should be passed as an argument to the script, like we did above with
headtail.sh
- Add a bit of reporting —
echo
statements,date
, etc, along the lines of what we did above withheadtail.sh
- Add some comments to describe what the code in the script is doing
Footnotes
But note that at OSC, you would not be able to remove anything you’re not supposed to, since you don’t have the permissions to do so. On your own computer, this could be more genuinely dangerous, though even there, you would not be able to remove the operating system without specifically requesting “admin” rights.↩︎
Because our script has a shebang line, we could also execute the script without the
bash
command using./printname.sh
. However, this would also require us to “make the script executable”, which is beyond the scope of this workshop.↩︎Of course, we could also simply write the output to a predefined (“hardcoded”) file name such as
out.txt
, but in general, it’s better practice to keep this flexible via an argument.↩︎We’ll see in the upcoming
SLURM
module that we when submit scripts to the OSC queue (rather than running them directly), the output of scripts that is normally printed to screen, will instead go to a sort of “log” file. So, your script’s reporting will end up in this file.↩︎