Shell Scripting

Author

Jelmer Poelstra

Shell scripts (or to be slightly more precise, Bash scripts) enable us to run sets of commands non-interactively. This is especially beneficial or necessary when a set of commands:

Takes a long time to run and/or
Should be run many times, e.g. for different samples

Scripts form the basis for analysis pipelines and if we code things cleverly, it should be straightforward to rerun much of our project workflow:

After removing or adding a sample
For different parameter settings
And possibly even for an entirely different dataset.

1 Setup

Starting a VS Code session with an active terminal (click here)

Log in to OSC at https://ondemand.osc.edu.
In the blue top bar, select Interactive Apps and then Code Server.
In the form that appears:
- Enter 4 or more in the box Number of hours
- To avoid having to switch folders within VS Code, enter /fs/ess/scratch/PAS2250/participants/<your-folder> in the box Working Directory (replace <your-folder> by the actual name of your folder).
- Click Launch.
On the next page, once the top bar of the box is green and says Runnning, click Connect to VS Code.
Open a terminal: => Terminal => New Terminal.
In the terminal, type bash and press Enter.
Type pwd in the termain to check you are in /fs/ess/scratch/PAS2250.

If not, click => File => Open Folder and enter /fs/ess/scratch/PAS2250/<your-folder>.

2 Script header lines and zombie scripts

2.1 Shebang line

We use a so-called “shebang” line as the first line of a script to indicate which language our script uses. More specifically, this line tell the computer where to find the binary (executable) that will run our script.

Such a line starts with #!, basically marking it as a special type of comment. After that, we provide the location to the relevant program: in our case Bash, which is located at /bin/bash on Linux and Mac computers.

#!/bin/bash

Adding a shebang line is good practice in general, and is necessary when we want to submit our script to OSC’s Slurm queue, which we’ll do tomorrow.

2.2 Bash script settings

Another line that is good practice to add to your Bash scripts changes some default settings to safer alternatives. The following two Bash default settings are bad ideas inside scripts:

First, and as we’ve seen in the previous module, Bash does not complain when you reference a variable that does not exist (in other words, it does not consider that an error).

In scripts, this can lead to all sorts of downstream problems, because you very likely tried and failed to do something with an actual variable. Even more problematically, it can lead to potentially very destructive file removal:

# Using a variable, we try to remove some temporary files whose names start with tmp_
temp_prefix="temp_"
rm "$tmp_prefix"*     # DON'T TRY THIS!

# Using a variable, we try to remove a temporary directory
tempdir=output/tmp
rm -rf $tmpdir/*      # DON'T TRY THIS!

The comments above specified the intent we had. What would have actually happened?

In both examples, there is a similar typo: temp vs. tmp, which means that we are referencing a (likely) non-existent variable.

In the first example, rm "$tmp_prefix"* would have been interpreted as rm *, because the non-existent variable is simply ignored. Therefore, we would have removed all files in the current working directory.
In the second example, along similar lines, rm -rf $tmpdir/* would have been interpreted as rm -rf /*. Horrifyingly, this would attempt to remove the entire filesystem (recall that a leading / in a path is a computer’s root directory).¹ (-r makes the removal recursive and -f makes forces removal).

Second, a Bash script keeps running after encountering errors. That is, if an error is encountered when running line 2 of a script, any remaining lines in the script will nevertheless be executed.

In the best case, this is a waste of computer resources, and in worse cases, it can lead to all kinds of unintended consequences. Additionally, if your script prints a lot of output, you might not notice an error somewhere in the middle if it doesn’t produce more errors downstream. But the downstream results from what we at that point might call a “zombie script” may still be completely wrong.

The following three settings will make your Bash scripts more robust and safer. With these settings, the script terminates, with an appropriate error message, if:

set -u — An unset (non-existent) variable is referenced.
set -e — Almost any error occurs.
set -o pipefail — An error occurs in a shell “pipeline” (e.g., sort | uniq).

We can change all of these settings in one line in a script:

set -u -e -o pipefail     # (For in a script - don't run in the terminal)

Or even more concisely:

set -ueo pipefail         # (For in a script - don't run in the terminal)

2.3 Our header lines as a rudimentary script

Let’s go ahead and start a script with the header lines that we have so far discussed.

Inside your personal directory within /fs/ess/scratch/PAS2250/participants, make a directory called scripts and one called sandbox (e.g. mkdir scripts sandbox, or use the VS Code menus.
Open a new file in the VS Code editor ( => File => New File) and save it as printname.sh within the newly created scripts dir.

Shell scripts, including Bash scripts, most commonly have the extension .sh
Type the following lines in that script (not in your terminal!):
```
#!/bin/bash
set -ueo pipefail
```

Already now, we could run (execute) the script. One way of doing this is calling the bash command followed by the name of the script²:

bash scripts/printname.sh

Doing this won’t print anything to screen (or file). Since our script doesn’t have any output, that makes sense — no output can be a good sign, because it means that no errors were encountered.

3 Command-line arguments for scripts

3.1 Calling a script with arguments

When you call a script, you can pass it command-line arguments, such as a file to operate on.

This is much like when you provide a command like ls with arguments:

# Run ls without arguments:
ls

# Pass 1 filename as an argument to ls:
ls data/sampleA.fastq.gz

# Pass 2 filenames as arguments to ls, separated by spaces:
ls data/sampleA.fastq.gz data/sampleB.fastq.gz

Let’s see what this would look like with our printname.sh script and a fictional script fastqc.sh:

# Run scripts without any arguments:
bash fastqc.sh                            # (Fictional script)
bash scripts/printname.sh

# Run scripts with 1 or 2 arguments:
bash fastqc.sh data/sampleA.fastq.gz      # 1 argument, a filename
bash scripts/printname.sh John Doe        # 2 arguments, strings representing names

In the next section, we’ll see what happens when we pass arguments to a script on the command line.

3.2 Placeholder variables

Inside the script, any command-line arguments are automatically available in placeholder variables.

A first argument will be assigned to the variable $1, any second argument will be assigned to $2, any third argument will be assigned to $3, and so on.

In the calls to fastqc.sh and printname.sh above, what are the placeholder variables and their values?

In bash fastqc.sh data/sampleA.fastq.gz, a single argument, data/sampleA.fastq.gz, is passed to the script, and will be assigned to $1.

In bash scripts/printname.sh John Doe, two arguments are passed to the script: the first one (John) will be stored in $1, and the second one (Doe) in $2.

Placeholder variables are not automagically used

Arguments passed to a script are merely made available in placeholder variables — unless we explicitly include code in the script to do something with those variables, nothing else happens.

Let’s add code to our printname.sh script to “process” any first and last name that are passed to the script as command-line arguments. For now, our script will simply echo the placeholder variables, so that we can see what happens:

#!/bin/bash
set -ueo pipefail

echo "First name: $1"
echo "Last name: $2"

# (Note: this is a script. Don't enter this directly in your terminal.)

Next, we’ll run the script, passing the arguments John and Doe:

bash scripts/printname.sh John Doe

First name: John
Last name: Doe

On Your Own: Command-line arguments

In each case below, think about what might happen before you run the script. Then, run it, and if you didn’t make a successful prediction, try to figure out what happened instead.

Run the script (scripts/printname.sh) without passing arguments to it.
Deactivate (“comment out”) the line with set settings by inserting a # as the first character. Then, run the script again without passing arguments to it.
Double-quote John Doe when you run the script, i.e. run bash scripts/printname.sh "John Doe"

To get back to where we were, remove the # you inserted in the script in step 2 above.

Solutions

The script will error out because we are referencing variables that don’t exist: since we didn’t pass command-line arguments to the script, the $1 and $2 have not been set.

bash scripts/printname.sh

printname.sh: line 4: $1: unbound variable

The script will run in its entirety and not throw any errors, because we are now using default Bash settings such that referencing non-existent variables does not throw an error. Of course, no names are printed either, since we didn’t specify any:

bash scripts/printname.sh

First name: 
Last name:

Being commented out, the set line should read:

#set -ueo pipefail

Because we are quoting "John Doe", both names are passed as a single argument and both names end up in $1, the “first name”:

bash scripts/printname.sh "John Doe"

First name: John Doe
Last name:

3.3 Descriptive variable names

While you can use the $1-style placeholder variables throughout your script, I find it very useful to copy them to more descriptively named variables as follows:

#!/bin/bash
set -ueo pipefail

first_name=$1
last_name=$2
  
echo "First name: $first_name"
echo "Last name: $last_name"

# (Note: this is a script. Don't enter this directly in your terminal.)

Using descriptively named variables in your scripts has several advantages. It will make your script easier to understand for others and for yourself. It will also make it less likely that you make errors in your script in which you use the wrong variable in the wrong place.

Other variables that are automatically available inside scripts

$0 contains the script’s file name
$# contains the number of command-line arguments passed

On Your Own: A script to print a specific line

Write a script that prints a specific line (identified by line number) from a file.

Open a new file and save it as scripts/printline.sh
Start with the shebang and set lines
Your script takes two arguments: a file name ($1) and a line number ($2)
Copy the $1 and $2 variables to descriptively named variables
To print a specific line, think how you might combine head and tail to do this. If you’re at a loss, feel free to check out the top solution box.
Test the script by printing line 4 from data/meta/meta.tsv.

Solution: how to print a specific line number

For example, to print line 4 of data/meta/meta.tsv directly:

head -n 4 data/meta/meta.tsv | tail -n 1

Just note that in the script, you’ll be using variables instead of the “hardcode values” 4 and data/meta/meta.tsv.

How this command works:

head -n 4 data/meta/meta.tsv will print the first 4 lines of data/meta/meta.tsv
We pipe those 4 lines into the tail command
We ask tail to just print the last line of its input, which will in this case be line 4 of the original input file.

Full solution

#!/bin/bash
set -ueo pipefail
  
input_file=$1
line_nr=$2

head -n "$line_nr" "$input_file" | tail -n 1

To run the script and make it print the 4th line of meta.tsv:

bash scripts/printline.sh data/meta/meta.tsv 4

SRR7609471  beach   control 3   40982374    78.70

4 Script variations and improvements

4.1 A script to serve as a starting point

We’ve learned that the head command prints the first lines of a file, whereas the tail command prints the last lines. Sometimes it’s nice to be able to quickly see both ends of a file, so let’s write a little script that can do that for us, as a starting point for the next few modifications.

Open a new file, save it as scripts/headtail.sh, and add the following code to it:

#!/bin/bash
set -ueo pipefail

input_file=$1

head -n 2 "$input_file"
echo "---"
tail -n 2 "$input_file"

# (Note: this is a script. Don't enter this directly in your terminal.)

Next, let’s run our headtail.sh script:

bash scripts/headtail.sh data/meta/meta.tsv

accession   location    treatment   replicate   nreads_raw  pct_mapped
SRR7609473  beach   control 1   45285752    76.01
---
SRR7609467  inland  treatment   2   47303936    79.25
SRR7609474  inland  treatment   3   55728624    78.80

4.2 Redirecting output to a file

So far, the output of our scripts was printed to screen, e.g.:

In printnames.sh, we simply echo’d, inside sentences, the arguments passed to the script.
In headtail.sh, we printed the first and last few lines of a file.

All this output was printed to screen because that is the default output mode of Unix commands, and this works the same way regardless of whether those commands are run directly on the command line, or are run inside a script.

Along those same lines, we have already learned that we can “redirect” output to a file using > (write/overwrite) and >> (append) when we run shell commands — and this, too, works exactly the same way inside a script.

When working with genomics data, we commonly have files as input, and new/modified files as output. Let’s practice with this and modify our headtail.sh script so that it writes output to a file.

We’ll make the following changes:

We will have the script accept a second argument: the output file name³.
We will redirect the output of our head, echo, and tail commands to the output file. We’ll have to append (using >>) in the last two commands.

#!/bin/bash
set -ueo pipefail

input_file=$1
output_file=$2

head -n 2 "$input_file" > "$output_file"
echo "---" >> "$output_file"
tail -n 2 "$input_file" >> "$output_file"

# (Note: this is a script. Don't enter this directly in your terminal.)

Now we run the script again, this time also passing the name of an output file:

bash scripts/headtail.sh data/meta/meta.tsv sandbox/samples_headtail.txt

The script will no longer print any output to screen, and our output should instead be in sandbox/samples_headtail.txt:

# Check that the file exists and was just modified:
ls -lh sandbox/samples_headtail.txt

-rw-rw-r-- 1 jelmer jelmer 197 Aug 24 20:58 sandbox/samples_headtail.txt

# Print the contents of the file to screen
cat sandbox/samples_headtail.txt

accession   location    treatment   replicate   nreads_raw  pct_mapped
SRR7609473  beach   control 1   45285752    76.01
---
SRR7609467  inland  treatment   2   47303936    79.25
SRR7609474  inland  treatment   3   55728624    78.80

4.3 Report what’s happening

It is often useful to have your scripts “report” or “log” what is going on. Let’s keep thinking about a script that has file(s) as the main output, but instead of having no output printed to screen at all, we’ll print some logging output to screen. For instance:

What is the date and time
Which arguments were passed to the script
What are the output files
Perhaps even summaries of the output.

All of this can help with troubleshooting and record-keeping.⁴ Let’s try this with our headtail.sh script.

#!/bin/bash
set -ueo pipefail

## Copy placeholder variables
input_file=$1
output_file=$2

## Initial logging 
echo "Starting script $0"           # Print name of script
date                                # Print date & time
echo "Input file:   $input_file"
echo "Output file:  $output_file" 
echo                                # Print empty line to separate initial & final logging

## Print the first and last two lines to a separate file
head -n 2 "$input_file" > "$output_file"
echo "---" >> "$output_file"
tail -n 2 "$input_file" >> "$output_file"

## Final logging
echo "Listing the output file:"
ls -lh "$output_file"
echo "Done with script $0"
date

# (Note: this is a script. Don't enter this directly in your terminal.)

A couple of notes about the lines that were added to the script above:

Printing the date at the end of the script as well will allow you to check for how long the script ran, which can be informative for longer-running scripts.
Printing the input and output files (and the command-line arguments more generally) can be particularly useful for troubleshooting
We printed a “marker line” like Done with script, indicating that the end of the script was reached. This is handy due to our set settings: seeing this line printed means that no errors were encountered.
I also added some comment headers like “Initial logging” to make the script easier to read, and such comments can be made more extensive to really explain what is being done.

Let’s run the script again:

bash scripts/headtail.sh data/meta/meta.tsv sandbox/tmp.txt

Starting script scripts/headtail.sh
Wed Aug 24 08:58:53 PM CEST 2022
Input file:   data/meta/meta.tsv
Output file:  sandbox/tmp.txt

Listing the output file:
-rw-rw-r-- 1 jelmer jelmer 197 Aug 24 20:58 sandbox/tmp.txt
Done with script scripts/headtail.sh
Wed Aug 24 08:58:53 PM CEST 2022

The script printed some details for the output file, but not its contents (that would have worked here, but is usually not sensible when working with genomics data). Let’s take a look, though, to make sure the script worked:

cat sandbox/tmp.txt      # "cat" prints all of a file's contents

accession   location    treatment   replicate   nreads_raw  pct_mapped
SRR7609473  beach   control 1   45285752    76.01
---
SRR7609467  inland  treatment   2   47303936    79.25
SRR7609474  inland  treatment   3   55728624    78.80

echo, echo

The extensive reporting (echo-ing) may have seemed silly for our little script, but fairly extensive reporting (as well as testing, but that’s outside the scope of this workshop) can be very useful — and will be eventually a time-saver.

This is especially true for long-running scripts, or scripts that you often reuse and perhaps share with others.

On Your Own: A fanciful script

Modify your printline.sh script to:

Redirect output to a file
This output file should not be “hardcoded” in the script, but its name should be passed as an argument to the script, like we did above with headtail.sh
Add a bit of reporting — echo statements, date, etc, along the lines of what we did above with headtail.sh
Add some comments to describe what the code in the script is doing

The original printline.sh script

#!/bin/bash
set -ueo pipefail
  
input_file=$1
line_nr=$2

head -n "$line_nr" "$input_file" | tail -n 1

(One possible) solution

#!/bin/bash
set -ueo pipefail

## Copy placeholder variables
input_file=$1
output_file=$2
line_nr=$3

## Initial logging 
echo "Starting script $0"           # Print name of script
date                                # Print date & time
echo "Input file:   $input_file"
echo "Output file:  $output_file"
echo "Line number:  $line_nr"
echo                                # Print empty line to separate initial & final logging

## Print 1 specific line from the input file and redirect to an output file
head -n "$line_nr" "$input_file" | tail -n 1 > $output_file

## Final logging
echo "Listing the output file:"
ls -lh "$output_file"
echo "Done with script $0"
date

To run the script with the additional argument:

bash scripts/printline.sh data/meta/meta.tsv sandbox/meta_line.tsv 4

Starting script scripts/printline.sh
Wed Aug 24 08:58:53 PM CEST 2022
Input file:   data/meta/meta.tsv
Output file:  sandbox/meta_line.tsv
Line number:  4

Listing the output file:
-rw-rw-r-- 1 jelmer jelmer 42 Aug 24 20:58 sandbox/meta_line.tsv
Done with script scripts/printline.sh
Wed Aug 24 08:58:53 PM CEST 2022

Footnotes

But note that at OSC, you would not be able to remove anything you’re not supposed to, since you don’t have the permissions to do so. On your own computer, this could be more genuinely dangerous, though even there, you would not be able to remove the operating system without specifically requesting “admin” rights.↩︎
Because our script has a shebang line, we could also execute the script without the bash command using ./printname.sh. However, this would also require us to “make the script executable”, which is beyond the scope of this workshop.↩︎
Of course, we could also simply write the output to a predefined (“hardcoded”) file name such as out.txt, but in general, it’s better practice to keep this flexible via an argument.↩︎
We’ll see in the upcoming SLURM module that we when submit scripts to the OSC queue (rather than running them directly), the output of scripts that is normally printed to screen, will instead go to a sort of “log” file. So, your script’s reporting will end up in this file.↩︎