Variables, Globbing, and Loops
In this module, we will cover a few topics that are good to know about before you start writing and running shell scripts:
Using variables will allow you to run scripts flexibly, with different input files and settings.
for
loops will allow you to repeat operations — specifically, we will later use them to submit many scripts at the same time, one per input file or sample.We’ll be selecting files with wildcards –“globbing”– to loop over FASTQ files.
These are valuable skills in general — globbing is an essential technique in the Unix shell, and variables and for loops ubiquitous programming concepts.
1 Setup
Starting a VS Code session with an active terminal:
Log in to OSC at https://ondemand.osc.edu.
In the blue top bar, select
Interactive Apps
and thenCode Server
.In the form that appears:
- Enter
4
or more in the boxNumber of hours
- To avoid having to switch folders within VS Code, enter
/fs/ess/scratch/PAS2250/participants/<your-folder>
in the boxWorking Directory
(replace<your-folder>
by the actual name of your folder). - Click
Launch
.
- Enter
On the next page, once the top bar of the box is green and says
Runnning
, clickConnect to VS Code
.Open a terminal: =>
Terminal
=>New Terminal
.In the terminal, type
bash
and press Enter.Type
pwd
in the termain to check you are in/fs/ess/scratch/PAS2250
.If not, click =>
File
=>Open Folder
and enter/fs/ess/scratch/PAS2250/<your-folder>
.
2 Variables
In programming, we use variables for things that:
- We refer to repeatedly and/or
- Are subject to change.
These tend to be settings like the paths to input and output files, and parameter values for programs.
Using variables makes it easier to change such settings. We also need to understand variables to work with loops and with scripts.
2.1 Assigning and referencing variables
To assign a value to a variable in Bash (in short: to assign a variable), use the syntax variable=value
:
# Assign the value "beach" to the variable "location":
location=beach
# Assign the value "200" to the variable "nlines":
nlines=200
To reference a variable (i.e., to access its value), you need to put a dollar sign $
in front of its name. We’ll use the echo
command to review the values that our variables contain:
echo $location
beach
echo $nlines
200
Conveniently, we can directly use variables in lots of contexts, as if we had instead typed their values:
input_file=data/fastq/SRR7609467.fastq.gz
ls -lh $input_file
-rw-r--r-- 1 jelmer jelmer 8.3M Aug 16 13:45 data/fastq/SRR7609467.fastq.gz
ls_options="-lh" # (We'll talk about the quotes that are used here later)
ls $ls_options data/meta
total 4.0K
-rw-r--r-- 1 jelmer jelmer 583 Aug 16 10:36 meta.tsv
2.2 Rules and tips for naming variables
Variable names:
- Can contain letters, numbers, and underscores
- Cannot contain spaces, periods, or other special symbols
- Cannot start with a number
Try to make your variable names descriptive, like $input_file
and $ls_options
above, as opposed to say $x
and $bla
.
There are multiple ways of distinguishing words in the absence of spaces, such as $inputFile
and $input_file
: I prefer the latter, which is called “snake case”, and I always use lowercase.
2.3 Quoting variables
Above, we learned that a variable name cannot contain spaces. But what happens if our variable’s value contains spaces? First off, when we try to assign the variable without using quotes, we get an error:
today=Thu, Aug 18
Aug: command not found
But it works when we quote (with double quotes, "..."
) the entire string that makes up the value:
today="Thu, Aug 18"
echo $today
Thu, Aug 18
Now, let’s try to reference this variable in another context. Note that the touch
command can create new files, e.g. touch a.txt
creates the file a.txt
. So let’s try make a new file with today’s date:
touch README_$today.txt
ls
18.txt
Aug
README_Thu,
Like with assignment, our problems can be avoided by quoting a variable when we reference it:
touch README_"$today".txt
# This will list the most recently modified file (ls -t sorts by last modified date):
ls -t | head -n 1
README_Thu, Aug 18.txt
It is good practice to quote variables when you reference them: it never hurts, and avoids unexpected surprises.
2.4 Command substitution
If you want to store the result of a command in a variable, you can use a construct called “command substitution” by wrapping the command inside $()
.
Let’s see an example. The date
command will print the current date and time:
date
Wed Aug 24 08:59:02 PM CEST 2022
If we try to store the date in a variable directly, it doesn’t work: the literal string “date” is stored, not the output of the command:
today=date
echo "$today"
date
That’s why we need command substitution with $()
:
today=$(date)
echo "$today"
Wed Aug 24 08:59:02 PM CEST 2022
In practice, you might use command substitution with date
to include the current date in files. To do so, first, note that we can use date +%F
to print the date in YYYY-MM-DD
format, and omit the time:
date +%F
2022-08-24
Let’s use that in a command substitution — but a bit differently than before: we use the command substitution $(date +%F)
directly in our touch
command, rather than first assigning it to a variable:
# Create a file with our $today variable:
touch README_"$(date +%F)".txt
# Check the name of our newly created file:
ls -t | head -n 1
README_2022-08-24.txt
Among many other uses, command substitution is handy when you want your script to report some results, or when a next step in the script depends on a previous result.
On Your Own: Command substitution
Say we wanted to store and report the number of lines in a file, which can be a good QC measure for FASTQ and other genomic data files.
wc -l
gets you the number of lines, and you can use a trick to omit the filename:
wc -l data/fastq/SRR7609472.fastq.gz
30387 data/fastq/SRR7609472.fastq.gz
# Use `<` (input redirection) to omit the filename:
wc -l < data/fastq/SRR7609472.fastq.gz
30387
Use command substitution to store the output of the last command in a variable, and then use an echo
command to print:
The file has 30387 lines
2.5 At-home reading: Environment variables
3 Globbing with Shell wildcard expansion
Shell wildcard expansion is a very useful technique to select files. Selecting files with wildcard expansion is called globbing.
3.1 Shell wildcards
In the term “wildcard expansion”, wildcard refers to a few symbols that have a special meaning: specifically, they match certain characters in file names. We’ll see below what expansion refers to.
Here, we’ll only talk about the most-used wildcard, *
, in detail. But for the sake of completeness, I list them all below:
Wildcard | Matches |
---|---|
* |
Any number of any character, including nothing |
? |
Any single character |
[] and [^] |
One [] or everything except one ([^] ) of the “character set” within brackets |
3.2 The *
wildcard and wildcard expansion
A a first example of using *
, to match all files in a directory:
ls data/fastq/*
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
data/fastq/SRR7609470.fastq.gz
data/fastq/SRR7609471.fastq.gz
data/fastq/SRR7609472.fastq.gz
data/fastq/SRR7609473.fastq.gz
data/fastq/SRR7609474.fastq.gz
data/fastq/SRR7609475.fastq.gz
data/fastq/SRR7609476.fastq.gz
data/fastq/SRR7609477.fastq.gz
data/fastq/SRR7609478.fastq.gz
Of course ls data/fastq
would have shown the same files, but what happens under the hood is different:
ls data/fastq
— Thels
command detects and lists all files in the directoryls data/fastq/*
— The wildcard*
is expanded to all matching files, (in this case, all the files in this directory), and then that list of files is passed tols
. This command is therefore equivalent to running:ls data/fastq/SRR7609467.fastq.gz data/fastq/SRR7609468.fastq.gz data/fastq/SRR7609469.fastq.gz data/fastq/SRR7609470.fastq.gz data/fastq/SRR7609471.fastq.gz data/fastq/SRR7609472.fastq.gz data/fastq/SRR7609473.fastq.gz data/fastq/SRR7609474.fastq.gz data/fastq/SRR7609475.fastq.gz data/fastq/SRR7609476.fastq.gz data/fastq/SRR7609477.fastq.gz data/fastq/SRR7609478.fastq.gz
To see this, note that we don’t need to use ls
at all to get a listing of these files!
echo data/fastq/*
data/fastq/SRR7609467.fastq.gz data/fastq/SRR7609468.fastq.gz data/fastq/SRR7609469.fastq.gz data/fastq/SRR7609470.fastq.gz data/fastq/SRR7609471.fastq.gz data/fastq/SRR7609472.fastq.gz data/fastq/SRR7609473.fastq.gz data/fastq/SRR7609474.fastq.gz data/fastq/SRR7609475.fastq.gz data/fastq/SRR7609476.fastq.gz data/fastq/SRR7609477.fastq.gz data/fastq/SRR7609478.fastq.gz
A few more examples:
# This will still list all 12 FASTQ files --
# can be a good pattern to use to make sure you're not selecting other types of files
ls data/fastq/*fastq.gz
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
data/fastq/SRR7609470.fastq.gz
data/fastq/SRR7609471.fastq.gz
data/fastq/SRR7609472.fastq.gz
data/fastq/SRR7609473.fastq.gz
data/fastq/SRR7609474.fastq.gz
data/fastq/SRR7609475.fastq.gz
data/fastq/SRR7609476.fastq.gz
data/fastq/SRR7609477.fastq.gz
data/fastq/SRR7609478.fastq.gz
# Only select the ...67.fastq.gz, ...68.fastq.gz, and ...69.fastq.gz files
ls data/fastq/SRR760946*fastq.gz
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
3.3 Common uses of globbing
What can we use this for, other than listing matching files? Below, we’ll use globbing to select files to loop over. Even more commonly, we can use this to move (mv
), copy (cp
), or remove (rm
) multiple files at once. For example:
cp data/fastq/SRR760946* . # Copy 3 FASTQ files to your working dir
ls *fastq.gz # Check if they're here
SRR7609467.fastq.gz
SRR7609468.fastq.gz
SRR7609469.fastq.gz
rm *fastq.gz # Remove all FASTQ files in your working dir
ls *fastq.gz # Check if they're here
ls: cannot access ’*fastq.gz’: No such file or directory
Finally, let’s use globbing to remove the mess of files we made when learning about variables:
rm README_*
rm Aug 18.txt
4 For loops
Loops are a universal element of programming languages, and are used to repeat operations, such as when you want to run the same script or command for multiple files.
Here, we’ll only cover what is by far the most common type of loop: the for
loop.
for
loops iterate over a collection, such as a list of files: that is, they allow you to perform one or more actions for each element in the collection, one element at a time.
4.1 for
loop syntax and mechanics
Let’s see a first example, where our “collection” is just a very short list of numbers (1
, 2
, and 3
):
for a_number in 1 2 3; do
echo "In this iteration of the loop, the number is $a_number"
echo "--------"
done
In this iteration of the loop, the number is 1
--------
In this iteration of the loop, the number is 2
--------
In this iteration of the loop, the number is 3
--------
for
loops contain the following mandatory keywords:
Keyword | Purpose |
---|---|
for |
After for , we set the variable name |
in |
After in , we specify the collection we are looping over |
do |
After do , we have one ore more lines specifying what to do with each item |
done |
Tells the shell we are done with the loop |
The aspect that is perhaps most difficult to understand is that in each iteration of the loop, one element in the collection (in the example above, either 1
, 2
, or 3
) is being assigned to the variable specified after for
(in the example above, a_number
).
It is also important to realize that the loop runs sequentially for each item in the collection, and will run exactly as many times as there are items in the collection.
The following example, where we let the computer sleep for 1 second before printing the date and time with the date
command, demonstrates that the loop is being executed sequentially:
for a_number in 1 2 3; do
echo "In this iteration of the loop, the number is $a_number"
sleep 1s # Let the computer sleep for 1 second
date # Print the date and time
echo "--------"
done
In this iteration of the loop, the number is 1
Wed Aug 24 08:59:03 PM CEST 2022
--------
In this iteration of the loop, the number is 2
Wed Aug 24 08:59:04 PM CEST 2022
--------
In this iteration of the loop, the number is 3
Wed Aug 24 08:59:05 PM CEST 2022
--------
On Your Own: A simple loop
Create a loop that will print:
morel is an Ohio mushroom
destroying_angel is an Ohio mushroom
eyelash_cup is an Ohio mushroom
4.2 Looping over files with globbing
In practice, we rarely manually list the collection of items we want to loop over. Instead, we commonly loop over files directly using globbing:
# We make sure we only select gzipped FASTQ files using the `*fastq.gz` glob
for fastq_file in data/raw/*fastq.gz; do
echo "File $fastq_file has $(wc -l < $fastq_file) lines."
# More processing...
done
This technique is extremely useful, and I use it all the time. Take a moment to realize that we’re not doing a separate ls
and storing the results: as mentioned, we can directly use a globbing pattern to select our files.
If needed, you can use your globbing / wild card skills to narrow down the file selection:
# Perhaps we only want to select R1 files (forward reads):
for fastq_file in data/raw/*R1*fastq.gz; do
# Some file processing...
done
# Or only filenames starting with A or B:
for fastq_file in data/raw/[AB]*fastq.gz; do
# Some file processing...
done