Variables, Globbing, and Loops
In this module, we will cover a few topics that are good to know about before you start writing and running shell scripts:
Using variables will allow you to run scripts flexibly, with different input files and settings.
for
loops will allow you to repeat operations — specifically, we will later use them to submit many scripts at the same time, one per input file or sample.We’ll be selecting files with wildcards –“globbing”– to loop over FASTQ files.
These are valuable skills in general — globbing is an essential technique in the Unix shell, and variables and for loops ubiquitous programming concepts.
1 Setup
Starting a VS Code session with an active terminal:
Log in to OSC at https://ondemand.osc.edu.
In the blue top bar, select
Interactive Apps
and thenCode Server
.In the form that appears:
- Enter
4
or more in the boxNumber of hours
- To avoid having to switch folders within VS Code, enter
/fs/ess/scratch/PAS2250/participants/<your-folder>
in the boxWorking Directory
(replace<your-folder>
by the actual name of your folder). - Click
Launch
.
- Enter
On the next page, once the top bar of the box is green and says
Runnning
, clickConnect to VS Code
.Open a terminal: =>
Terminal
=>New Terminal
.In the terminal, type
bash
and press Enter.Type
pwd
in the termain to check you are in/fs/ess/scratch/PAS2250
.If not, click =>
File
=>Open Folder
and enter/fs/ess/scratch/PAS2250/<your-folder>
.
2 Variables
In programming, we use variables for things that:
- We refer to repeatedly and/or
- Are subject to change.
These tend to be settings like the paths to input and output files, and parameter values for programs.
Using variables makes it easier to change such settings. We also need to understand variables to work with loops and with scripts.
2.1 Assigning and referencing variables
To assign a value to a variable in Bash (in short: to assign a variable), use the syntax variable=value
:
# Assign the value "beach" to the variable "location":
location=beach
# Assign the value "200" to the variable "nlines":
nlines=200
=
)!
To reference a variable (i.e., to access its value), you need to put a dollar sign $
in front of its name. We’ll use the echo
command to review the values that our variables contain:
echo
simply prints back (“echoes”) whatever you tell it to
echo Hello!
Hello!
echo $location
beach
echo $nlines
200
Conveniently, we can directly use variables in lots of contexts, as if we had instead typed their values:
input_file=data/fastq/SRR7609467.fastq.gz
ls -lh $input_file
-rw-r--r-- 1 jelmer jelmer 8.3M Aug 16 13:45 data/fastq/SRR7609467.fastq.gz
ls_options="-lh" # (We'll talk about the quotes that are used here later)
ls $ls_options data/meta
total 4.0K
-rw-r--r-- 1 jelmer jelmer 583 Aug 16 10:36 meta.tsv
2.2 Rules and tips for naming variables
Variable names:
- Can contain letters, numbers, and underscores
- Cannot contain spaces, periods, or other special symbols
- Cannot start with a number
Try to make your variable names descriptive, like $input_file
and $ls_options
above, as opposed to say $x
and $bla
.
There are multiple ways of distinguishing words in the absence of spaces, such as $inputFile
and $input_file
: I prefer the latter, which is called “snake case”, and I always use lowercase.
2.3 Quoting variables
Above, we learned that a variable name cannot contain spaces. But what happens if our variable’s value contains spaces? First off, when we try to assign the variable without using quotes, we get an error:
today=Thu, Aug 18
Aug: command not found
Bash tried assign everything up to the first space (i.e., Thu,
) to today
. After that, since we used a space, it assumed the next word (Aug
) was something else: specifically, another command.
But it works when we quote (with double quotes, "..."
) the entire string that makes up the value:
today="Thu, Aug 18"
echo $today
Thu, Aug 18
Now, let’s try to reference this variable in another context. Note that the touch
command can create new files, e.g. touch a.txt
creates the file a.txt
. So let’s try make a new file with today’s date:
touch README_$today.txt
ls
18.txt
Aug
README_Thu,
The shell performed so-called field splitting using spaces as a separator, splitting the value into three separate units – as a result, three files were created.
Like with assignment, our problems can be avoided by quoting a variable when we reference it:
touch README_"$today".txt
# This will list the most recently modified file (ls -t sorts by last modified date):
ls -t | head -n 1
README_Thu, Aug 18.txt
It is good practice to quote variables when you reference them: it never hurts, and avoids unexpected surprises.
Another issue we can run into when we don’t quote variables is that we can’t explicitly define where a variable name ends within a longer string of text:
echo README_$today_final.txt
README_.txt
Following a
$
, the shell will stop interpreting characters as being part of the variable name only when it encounters a character that cannot be part of a variable name, such as a space or a period.Since variable names can contain underscores, it will look for the variable
$today_final
, which does not exist.Importantly, the shell does not error out when you reference a non-existing variable – it basically ignores it, such that
README_$today_final.txt
becomesREADME_.txt
, as if we hadn’t referenced any variable.
Quoting solves this issue, too:
echo README_"$today"_final.txt
README_Thu, Aug 18_final.txt
By double-quoting a variable, we are essentially escaping (or “turning off”) the default special meaning of the space as a separator, and are asking the shell to interpret it as a literal space.
Similarly, we are escaping other “special characters”, such as globbing wildcards, with double quotes. Compare:
echo * # This will echo/list all files in the current working dir (!)
18.txt Aug data README_Thu, README_Thu, Aug 18.txt sandbox scripts
echo "*" # This will simply print the "*" character
*
However, as we saw above, double quotes do not turn off the special meaning of $
(denoting a string as a variable):
echo "$today"
Thu, Aug 18
…but single quotes will:
echo '$today'
$today
2.4 Command substitution
If you want to store the result of a command in a variable, you can use a construct called “command substitution” by wrapping the command inside $()
.
Let’s see an example. The date
command will print the current date and time:
date
Wed Aug 24 08:59:02 PM CEST 2022
If we try to store the date in a variable directly, it doesn’t work: the literal string “date” is stored, not the output of the command:
today=date
echo "$today"
date
That’s why we need command substitution with $()
:
today=$(date)
echo "$today"
Wed Aug 24 08:59:02 PM CEST 2022
In practice, you might use command substitution with date
to include the current date in files. To do so, first, note that we can use date +%F
to print the date in YYYY-MM-DD
format, and omit the time:
date +%F
2022-08-24
Let’s use that in a command substitution — but a bit differently than before: we use the command substitution $(date +%F)
directly in our touch
command, rather than first assigning it to a variable:
# Create a file with our $today variable:
touch README_"$(date +%F)".txt
# Check the name of our newly created file:
ls -t | head -n 1
README_2022-08-24.txt
Among many other uses, command substitution is handy when you want your script to report some results, or when a next step in the script depends on a previous result.
On Your Own: Command substitution
Say we wanted to store and report the number of lines in a file, which can be a good QC measure for FASTQ and other genomic data files.
wc -l
gets you the number of lines, and you can use a trick to omit the filename:
wc -l data/fastq/SRR7609472.fastq.gz
30387 data/fastq/SRR7609472.fastq.gz
# Use `<` (input redirection) to omit the filename:
wc -l < data/fastq/SRR7609472.fastq.gz
30387
Use command substitution to store the output of the last command in a variable, and then use an echo
command to print:
The file has 30387 lines
nlines=$(wc -l < data/fastq/SRR7609472.fastq.gz)
echo "The file $nlines lines"
The file 30387 lines
Note: You don’t have to quote variables inside a quoted echo
call, since it’s, well, already quoted. If you also quote the variables, you will in fact unquote it, although that shouldn’t pose a problem inside echo statements.
2.5 At-home reading: Environment variables
There are also predefined variables in the Unix shell: that is, variables that exist in your environment by default. These so-called “environment variables” are always spelled in all-caps:
# Environment variable $USER contains your user name
echo $USER
jelmer
# Environment variable $HOME contains the path to your home directory
echo $HOME
/users/PAS0471/jelmer
Environment variables can provide useful information. They can especially come in handy in in scripts submitted to the Slurm compute job scheduler.
3 Globbing with Shell wildcard expansion
Shell wildcard expansion is a very useful technique to select files. Selecting files with wildcard expansion is called globbing.
3.1 Shell wildcards
In the term “wildcard expansion”, wildcard refers to a few symbols that have a special meaning: specifically, they match certain characters in file names. We’ll see below what expansion refers to.
Here, we’ll only talk about the most-used wildcard, *
, in detail. But for the sake of completeness, I list them all below:
Wildcard | Matches |
---|---|
* |
Any number of any character, including nothing |
? |
Any single character |
[] and [^] |
One [] or everything except one ([^] ) of the “character set” within brackets |
3.2 The *
wildcard and wildcard expansion
A a first example of using *
, to match all files in a directory:
ls data/fastq/*
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
data/fastq/SRR7609470.fastq.gz
data/fastq/SRR7609471.fastq.gz
data/fastq/SRR7609472.fastq.gz
data/fastq/SRR7609473.fastq.gz
data/fastq/SRR7609474.fastq.gz
data/fastq/SRR7609475.fastq.gz
data/fastq/SRR7609476.fastq.gz
data/fastq/SRR7609477.fastq.gz
data/fastq/SRR7609478.fastq.gz
Of course ls data/fastq
would have shown the same files, but what happens under the hood is different:
ls data/fastq
— Thels
command detects and lists all files in the directoryls data/fastq/*
— The wildcard*
is expanded to all matching files, (in this case, all the files in this directory), and then that list of files is passed tols
. This command is therefore equivalent to running:ls data/fastq/SRR7609467.fastq.gz data/fastq/SRR7609468.fastq.gz data/fastq/SRR7609469.fastq.gz data/fastq/SRR7609470.fastq.gz data/fastq/SRR7609471.fastq.gz data/fastq/SRR7609472.fastq.gz data/fastq/SRR7609473.fastq.gz data/fastq/SRR7609474.fastq.gz data/fastq/SRR7609475.fastq.gz data/fastq/SRR7609476.fastq.gz data/fastq/SRR7609477.fastq.gz data/fastq/SRR7609478.fastq.gz
To see this, note that we don’t need to use ls
at all to get a listing of these files!
echo data/fastq/*
data/fastq/SRR7609467.fastq.gz data/fastq/SRR7609468.fastq.gz data/fastq/SRR7609469.fastq.gz data/fastq/SRR7609470.fastq.gz data/fastq/SRR7609471.fastq.gz data/fastq/SRR7609472.fastq.gz data/fastq/SRR7609473.fastq.gz data/fastq/SRR7609474.fastq.gz data/fastq/SRR7609475.fastq.gz data/fastq/SRR7609476.fastq.gz data/fastq/SRR7609477.fastq.gz data/fastq/SRR7609478.fastq.gz
A few more examples:
# This will still list all 12 FASTQ files --
# can be a good pattern to use to make sure you're not selecting other types of files
ls data/fastq/*fastq.gz
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
data/fastq/SRR7609470.fastq.gz
data/fastq/SRR7609471.fastq.gz
data/fastq/SRR7609472.fastq.gz
data/fastq/SRR7609473.fastq.gz
data/fastq/SRR7609474.fastq.gz
data/fastq/SRR7609475.fastq.gz
data/fastq/SRR7609476.fastq.gz
data/fastq/SRR7609477.fastq.gz
data/fastq/SRR7609478.fastq.gz
# Only select the ...67.fastq.gz, ...68.fastq.gz, and ...69.fastq.gz files
ls data/fastq/SRR760946*fastq.gz
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
.fastq.gz
) and plain FASTQ files (.fastq
) at the same time?
ls data/fastq/SRR760946*.fastq*
The second *
will match filenames with nothing after .fastq
as well as file names with characters after .fastq
, such as .gz
.
3.3 Common uses of globbing
What can we use this for, other than listing matching files? Below, we’ll use globbing to select files to loop over. Even more commonly, we can use this to move (mv
), copy (cp
), or remove (rm
) multiple files at once. For example:
cp data/fastq/SRR760946* . # Copy 3 FASTQ files to your working dir
ls *fastq.gz # Check if they're here
SRR7609467.fastq.gz
SRR7609468.fastq.gz
SRR7609469.fastq.gz
rm *fastq.gz # Remove all FASTQ files in your working dir
ls *fastq.gz # Check if they're here
ls: cannot access ’*fastq.gz’: No such file or directory
Finally, let’s use globbing to remove the mess of files we made when learning about variables:
rm README_*
rm Aug 18.txt
For those of you who know some regular expressions: these are conceptually similar to wildcards, but the *
and ?
symbols don’t have the same meaning, and there are way fewer shell wildcards than regular expression symbols.
In particular, note that .
is not a shell wildcard and thus represents a literal period.
4 For loops
Loops are a universal element of programming languages, and are used to repeat operations, such as when you want to run the same script or command for multiple files.
Here, we’ll only cover what is by far the most common type of loop: the for
loop.
for
loops iterate over a collection, such as a list of files: that is, they allow you to perform one or more actions for each element in the collection, one element at a time.
4.1 for
loop syntax and mechanics
Let’s see a first example, where our “collection” is just a very short list of numbers (1
, 2
, and 3
):
for a_number in 1 2 3; do
echo "In this iteration of the loop, the number is $a_number"
echo "--------"
done
In this iteration of the loop, the number is 1
--------
In this iteration of the loop, the number is 2
--------
In this iteration of the loop, the number is 3
--------
for
loops contain the following mandatory keywords:
Keyword | Purpose |
---|---|
for |
After for , we set the variable name |
in |
After in , we specify the collection we are looping over |
do |
After do , we have one ore more lines specifying what to do with each item |
done |
Tells the shell we are done with the loop |
;
(as used before do
) separates two commands on a single line
A semicolon separates two commands written on a single line – for instance, instead of:
mkdir results
cd results
…you could equivalently type:
mkdir results; cd results
The ;
in the for
loop syntax has the same function, and as such, an alternative way to format a for
loop is:
for a_number in 1 2 3
do
echo "In this iteration of the loop, the number is $a_number"
done
But that’s one line longer and a bit awkwardly asymmetric.
The aspect that is perhaps most difficult to understand is that in each iteration of the loop, one element in the collection (in the example above, either 1
, 2
, or 3
) is being assigned to the variable specified after for
(in the example above, a_number
).
It is also important to realize that the loop runs sequentially for each item in the collection, and will run exactly as many times as there are items in the collection.
The following example, where we let the computer sleep for 1 second before printing the date and time with the date
command, demonstrates that the loop is being executed sequentially:
for a_number in 1 2 3; do
echo "In this iteration of the loop, the number is $a_number"
sleep 1s # Let the computer sleep for 1 second
date # Print the date and time
echo "--------"
done
In this iteration of the loop, the number is 1
Wed Aug 24 08:59:03 PM CEST 2022
--------
In this iteration of the loop, the number is 2
Wed Aug 24 08:59:04 PM CEST 2022
--------
In this iteration of the loop, the number is 3
Wed Aug 24 08:59:05 PM CEST 2022
--------
On Your Own: A simple loop
Create a loop that will print:
morel is an Ohio mushroom
destroying_angel is an Ohio mushroom
eyelash_cup is an Ohio mushroom
Just like we looped over 3 numbers above (
1
,2
, and3
), you want to loop over the three mushroom names,morel
,destroying_angel
, andeyelash_cup
.Notice that when we specify the collection “manually”, like we did above with numbers, the elements are simply separated by a space.
for mushroom in morel destroying_angel eyelash_cup; do
echo "$mushroom is an Ohio mushroom"
done
morel is an Ohio mushroom
destroying_angel is an Ohio mushroom
eyelash_cup is an Ohio mushroom
4.2 Looping over files with globbing
In practice, we rarely manually list the collection of items we want to loop over. Instead, we commonly loop over files directly using globbing:
# We make sure we only select gzipped FASTQ files using the `*fastq.gz` glob
for fastq_file in data/raw/*fastq.gz; do
echo "File $fastq_file has $(wc -l < $fastq_file) lines."
# More processing...
done
This technique is extremely useful, and I use it all the time. Take a moment to realize that we’re not doing a separate ls
and storing the results: as mentioned, we can directly use a globbing pattern to select our files.
If needed, you can use your globbing / wild card skills to narrow down the file selection:
# Perhaps we only want to select R1 files (forward reads):
for fastq_file in data/raw/*R1*fastq.gz; do
# Some file processing...
done
# Or only filenames starting with A or B:
for fastq_file in data/raw/[AB]*fastq.gz; do
# Some file processing...
done
With genomics data, the routine of looping over an entire directory of files, or selections made with simple globbing patterns, should serve you very well.
But in some cases, you may want to iterate only over a specific list of filenames (or partial filenames such as sample IDs) that represent a complex selection.
If this is a short list, you could directly specify it in the loop:
for sample in A1 B6 D3; do R1=data/fastq/"$sample"_R1.fastq.gz R2=data/fastq/"$sample"_R2.fastq.gz # Some file processing... done
If it is a longer list, you could create a simple text file with one line per sample ID / filename, and use command substitution as follows:
for fastq_file in $(cat file_of_filenames.txt); do # Some file processing... done
In cases like this, Bash arrays (basically, variables that consist of multiple values, like a vector in R) or while
loops may provide more elegant solutions, but those are outside the scope of this introduction.