Variables, Globbing, and Loops

Author

Jelmer Poelstra


In this module, we will cover a few topics that are good to know about before you start writing and running shell scripts:

These are valuable skills in general — globbing is an essential technique in the Unix shell, and variables and for loops ubiquitous programming concepts.


1 Setup

Starting a VS Code session with an active terminal:

  1. Log in to OSC at https://ondemand.osc.edu.

  2. In the blue top bar, select Interactive Apps and then Code Server.

  3. In the form that appears:

    • Enter 4 or more in the box Number of hours
    • To avoid having to switch folders within VS Code, enter /fs/ess/scratch/PAS2250/participants/<your-folder> in the box Working Directory (replace <your-folder> by the actual name of your folder).
    • Click Launch.
  4. On the next page, once the top bar of the box is green and says Runnning, click Connect to VS Code.

  5. Open a terminal:   => Terminal => New Terminal.

  6. In the terminal, type bash and press Enter.

  7. Type pwd in the termain to check you are in /fs/ess/scratch/PAS2250.

    If not, click   =>   File   =>   Open Folder and enter /fs/ess/scratch/PAS2250/<your-folder>.


2 Variables

In programming, we use variables for things that:

  • We refer to repeatedly and/or
  • Are subject to change.

These tend to be settings like the paths to input and output files, and parameter values for programs.

Using variables makes it easier to change such settings. We also need to understand variables to work with loops and with scripts.

2.1 Assigning and referencing variables

To assign a value to a variable in Bash (in short: to assign a variable), use the syntax variable=value:

# Assign the value "beach" to the variable "location":
location=beach

# Assign the value "200" to the variable "nlines":
nlines=200
Be aware: don’t put spaces around the equals sign (=)!

To reference a variable (i.e., to access its value), you need to put a dollar sign $ in front of its name. We’ll use the echo command to review the values that our variables contain:

echo simply prints back (“echoes”) whatever you tell it to
echo Hello!
Hello!
echo $location
beach
echo $nlines
200

Conveniently, we can directly use variables in lots of contexts, as if we had instead typed their values:

input_file=data/fastq/SRR7609467.fastq.gz

ls -lh $input_file 
-rw-r--r-- 1 jelmer jelmer 8.3M Aug 16 13:45 data/fastq/SRR7609467.fastq.gz
ls_options="-lh"            # (We'll talk about the quotes that are used here later)

ls $ls_options data/meta
total 4.0K
-rw-r--r-- 1 jelmer jelmer 583 Aug 16 10:36 meta.tsv


2.2 Rules and tips for naming variables

Variable names:

  • Can contain letters, numbers, and underscores
  • Cannot contain spaces, periods, or other special symbols
  • Cannot start with a number

Try to make your variable names descriptive, like $input_file and $ls_options above, as opposed to say $x and $bla.

There are multiple ways of distinguishing words in the absence of spaces, such as $inputFile and $input_file: I prefer the latter, which is called “snake case”, and I always use lowercase.


2.3 Quoting variables

Above, we learned that a variable name cannot contain spaces. But what happens if our variable’s value contains spaces? First off, when we try to assign the variable without using quotes, we get an error:

today=Thu, Aug 18

Aug: command not found

Bash tried assign everything up to the first space (i.e., Thu,) to today. After that, since we used a space, it assumed the next word (Aug) was something else: specifically, another command.

But it works when we quote (with double quotes, "...") the entire string that makes up the value:

today="Thu, Aug 18"
echo $today
Thu, Aug 18

Now, let’s try to reference this variable in another context. Note that the touch command can create new files, e.g. touch a.txt creates the file a.txt. So let’s try make a new file with today’s date:

touch README_$today.txt
ls

18.txt
Aug
README_Thu,

The shell performed so-called field splitting using spaces as a separator, splitting the value into three separate units – as a result, three files were created.

Like with assignment, our problems can be avoided by quoting a variable when we reference it:

touch README_"$today".txt

# This will list the most recently modified file (ls -t sorts by last modified date):
ls -t | head -n 1
README_Thu, Aug 18.txt

It is good practice to quote variables when you reference them: it never hurts, and avoids unexpected surprises.

Another issue we can run into when we don’t quote variables is that we can’t explicitly define where a variable name ends within a longer string of text:

echo README_$today_final.txt
README_.txt
  • Following a $, the shell will stop interpreting characters as being part of the variable name only when it encounters a character that cannot be part of a variable name, such as a space or a period.

  • Since variable names can contain underscores, it will look for the variable $today_final, which does not exist.

  • Importantly, the shell does not error out when you reference a non-existing variable – it basically ignores it, such that README_$today_final.txt becomes README_.txt, as if we hadn’t referenced any variable.

Quoting solves this issue, too:

echo README_"$today"_final.txt
README_Thu, Aug 18_final.txt

By double-quoting a variable, we are essentially escaping (or “turning off”) the default special meaning of the space as a separator, and are asking the shell to interpret it as a literal space.

Similarly, we are escaping other “special characters”, such as globbing wildcards, with double quotes. Compare:

echo *     # This will echo/list all files in the current working dir (!)
18.txt Aug data README_Thu, README_Thu, Aug 18.txt sandbox scripts
echo "*"   # This will simply print the "*" character 
*

However, as we saw above, double quotes do not turn off the special meaning of $ (denoting a string as a variable):

echo "$today"
Thu, Aug 18

…but single quotes will:

echo '$today'
$today


2.4 Command substitution

If you want to store the result of a command in a variable, you can use a construct called “command substitution” by wrapping the command inside $().

Let’s see an example. The date command will print the current date and time:

date
Wed Aug 24 08:59:02 PM CEST 2022

If we try to store the date in a variable directly, it doesn’t work: the literal string “date” is stored, not the output of the command:

today=date
echo "$today"
date

That’s why we need command substitution with $():

today=$(date)
echo "$today"
Wed Aug 24 08:59:02 PM CEST 2022

In practice, you might use command substitution with date to include the current date in files. To do so, first, note that we can use date +%F to print the date in YYYY-MM-DD format, and omit the time:

date +%F
2022-08-24

Let’s use that in a command substitution — but a bit differently than before: we use the command substitution $(date +%F) directly in our touch command, rather than first assigning it to a variable:

# Create a file with our $today variable:
touch README_"$(date +%F)".txt

# Check the name of our newly created file:
ls -t | head -n 1
README_2022-08-24.txt

Among many other uses, command substitution is handy when you want your script to report some results, or when a next step in the script depends on a previous result.

On Your Own: Command substitution

Say we wanted to store and report the number of lines in a file, which can be a good QC measure for FASTQ and other genomic data files.

wc -l gets you the number of lines, and you can use a trick to omit the filename:

wc -l data/fastq/SRR7609472.fastq.gz
30387 data/fastq/SRR7609472.fastq.gz
# Use `<` (input redirection) to omit the filename:
wc -l < data/fastq/SRR7609472.fastq.gz
30387

Use command substitution to store the output of the last command in a variable, and then use an echo command to print:

The file has 30387 lines
nlines=$(wc -l < data/fastq/SRR7609472.fastq.gz)

echo "The file $nlines lines"
The file 30387 lines

Note: You don’t have to quote variables inside a quoted echo call, since it’s, well, already quoted. If you also quote the variables, you will in fact unquote it, although that shouldn’t pose a problem inside echo statements.


2.5 At-home reading: Environment variables

There are also predefined variables in the Unix shell: that is, variables that exist in your environment by default. These so-called “environment variables” are always spelled in all-caps:

# Environment variable $USER contains your user name 
echo $USER
jelmer
# Environment variable $HOME contains the path to your home directory
echo $HOME

/users/PAS0471/jelmer

Environment variables can provide useful information. They can especially come in handy in in scripts submitted to the Slurm compute job scheduler.


3 Globbing with Shell wildcard expansion

Shell wildcard expansion is a very useful technique to select files. Selecting files with wildcard expansion is called globbing.

3.1 Shell wildcards

In the term “wildcard expansion”, wildcard refers to a few symbols that have a special meaning: specifically, they match certain characters in file names. We’ll see below what expansion refers to.

Here, we’ll only talk about the most-used wildcard, *, in detail. But for the sake of completeness, I list them all below:

Wildcard Matches
* Any number of any character, including nothing
? Any single character
[] and [^] One [] or everything except one ([^]) of the “character set” within brackets


3.2 The * wildcard and wildcard expansion

A a first example of using *, to match all files in a directory:

ls data/fastq/*
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
data/fastq/SRR7609470.fastq.gz
data/fastq/SRR7609471.fastq.gz
data/fastq/SRR7609472.fastq.gz
data/fastq/SRR7609473.fastq.gz
data/fastq/SRR7609474.fastq.gz
data/fastq/SRR7609475.fastq.gz
data/fastq/SRR7609476.fastq.gz
data/fastq/SRR7609477.fastq.gz
data/fastq/SRR7609478.fastq.gz

Of course ls data/fastq would have shown the same files, but what happens under the hood is different:

  • ls data/fastq — The ls command detects and lists all files in the directory

  • ls data/fastq/* — The wildcard * is expanded to all matching files, (in this case, all the files in this directory), and then that list of files is passed to ls. This command is therefore equivalent to running:

    ls data/fastq/SRR7609467.fastq.gz data/fastq/SRR7609468.fastq.gz data/fastq/SRR7609469.fastq.gz data/fastq/SRR7609470.fastq.gz data/fastq/SRR7609471.fastq.gz data/fastq/SRR7609472.fastq.gz data/fastq/SRR7609473.fastq.gz data/fastq/SRR7609474.fastq.gz data/fastq/SRR7609475.fastq.gz data/fastq/SRR7609476.fastq.gz data/fastq/SRR7609477.fastq.gz data/fastq/SRR7609478.fastq.gz

To see this, note that we don’t need to use ls at all to get a listing of these files!

echo data/fastq/*
data/fastq/SRR7609467.fastq.gz data/fastq/SRR7609468.fastq.gz data/fastq/SRR7609469.fastq.gz data/fastq/SRR7609470.fastq.gz data/fastq/SRR7609471.fastq.gz data/fastq/SRR7609472.fastq.gz data/fastq/SRR7609473.fastq.gz data/fastq/SRR7609474.fastq.gz data/fastq/SRR7609475.fastq.gz data/fastq/SRR7609476.fastq.gz data/fastq/SRR7609477.fastq.gz data/fastq/SRR7609478.fastq.gz

A few more examples:

# This will still list all 12 FASTQ files --
# can be a good pattern to use to make sure you're not selecting other types of files 
ls data/fastq/*fastq.gz
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
data/fastq/SRR7609470.fastq.gz
data/fastq/SRR7609471.fastq.gz
data/fastq/SRR7609472.fastq.gz
data/fastq/SRR7609473.fastq.gz
data/fastq/SRR7609474.fastq.gz
data/fastq/SRR7609475.fastq.gz
data/fastq/SRR7609476.fastq.gz
data/fastq/SRR7609477.fastq.gz
data/fastq/SRR7609478.fastq.gz
# Only select the ...67.fastq.gz, ...68.fastq.gz, and ...69.fastq.gz files 
ls data/fastq/SRR760946*fastq.gz
data/fastq/SRR7609467.fastq.gz
data/fastq/SRR7609468.fastq.gz
data/fastq/SRR7609469.fastq.gz
ls data/fastq/SRR760946*.fastq*

The second * will match filenames with nothing after .fastq as well as file names with characters after .fastq, such as .gz.


3.3 Common uses of globbing

What can we use this for, other than listing matching files? Below, we’ll use globbing to select files to loop over. Even more commonly, we can use this to move (mv), copy (cp), or remove (rm) multiple files at once. For example:

cp data/fastq/SRR760946* .     # Copy 3 FASTQ files to your working dir 
ls *fastq.gz                   # Check if they're here
SRR7609467.fastq.gz
SRR7609468.fastq.gz
SRR7609469.fastq.gz
rm *fastq.gz                  # Remove all FASTQ files in your working dir
ls *fastq.gz                  # Check if they're here

ls: cannot access ’*fastq.gz’: No such file or directory

Finally, let’s use globbing to remove the mess of files we made when learning about variables:

rm README_*
rm Aug 18.txt

For those of you who know some regular expressions: these are conceptually similar to wildcards, but the * and ? symbols don’t have the same meaning, and there are way fewer shell wildcards than regular expression symbols.

In particular, note that . is not a shell wildcard and thus represents a literal period.


4 For loops

Loops are a universal element of programming languages, and are used to repeat operations, such as when you want to run the same script or command for multiple files.

Here, we’ll only cover what is by far the most common type of loop: the for loop.

for loops iterate over a collection, such as a list of files: that is, they allow you to perform one or more actions for each element in the collection, one element at a time.

4.1 for loop syntax and mechanics

Let’s see a first example, where our “collection” is just a very short list of numbers (1, 2, and 3):

for a_number in 1 2 3; do
    echo "In this iteration of the loop, the number is $a_number"
    echo "--------"
done
In this iteration of the loop, the number is 1
--------
In this iteration of the loop, the number is 2
--------
In this iteration of the loop, the number is 3
--------

for loops contain the following mandatory keywords:

Keyword Purpose
for After for, we set the variable name
in After in, we specify the collection we are looping over
do After do, we have one ore more lines specifying what to do with each item
done Tells the shell we are done with the loop

A semicolon separates two commands written on a single line – for instance, instead of:

mkdir results
cd results

…you could equivalently type:

mkdir results; cd results

The ; in the for loop syntax has the same function, and as such, an alternative way to format a for loop is:

for a_number in 1 2 3
do
    echo "In this iteration of the loop, the number is $a_number"
done

But that’s one line longer and a bit awkwardly asymmetric.

The aspect that is perhaps most difficult to understand is that in each iteration of the loop, one element in the collection (in the example above, either 1, 2, or 3) is being assigned to the variable specified after for (in the example above, a_number).


It is also important to realize that the loop runs sequentially for each item in the collection, and will run exactly as many times as there are items in the collection.

The following example, where we let the computer sleep for 1 second before printing the date and time with the date command, demonstrates that the loop is being executed sequentially:

for a_number in 1 2 3; do
    echo "In this iteration of the loop, the number is $a_number"
    sleep 1s          # Let the computer sleep for 1 second
    date              # Print the date and time
    echo "--------"
done
In this iteration of the loop, the number is 1
Wed Aug 24 08:59:03 PM CEST 2022
--------
In this iteration of the loop, the number is 2
Wed Aug 24 08:59:04 PM CEST 2022
--------
In this iteration of the loop, the number is 3
Wed Aug 24 08:59:05 PM CEST 2022
--------

On Your Own: A simple loop

Create a loop that will print:

morel is an Ohio mushroom  
destroying_angel is an Ohio mushroom  
eyelash_cup is an Ohio mushroom
  • Just like we looped over 3 numbers above (1, 2, and 3), you want to loop over the three mushroom names, morel, destroying_angel, and eyelash_cup.

  • Notice that when we specify the collection “manually”, like we did above with numbers, the elements are simply separated by a space.

for mushroom in morel destroying_angel eyelash_cup; do
    echo "$mushroom is an Ohio mushroom"
done
morel is an Ohio mushroom
destroying_angel is an Ohio mushroom
eyelash_cup is an Ohio mushroom


4.2 Looping over files with globbing

In practice, we rarely manually list the collection of items we want to loop over. Instead, we commonly loop over files directly using globbing:

# We make sure we only select gzipped FASTQ files using the `*fastq.gz` glob
for fastq_file in data/raw/*fastq.gz; do
    echo "File $fastq_file has $(wc -l < $fastq_file) lines."
    # More processing...
done

This technique is extremely useful, and I use it all the time. Take a moment to realize that we’re not doing a separate ls and storing the results: as mentioned, we can directly use a globbing pattern to select our files.

If needed, you can use your globbing / wild card skills to narrow down the file selection:

# Perhaps we only want to select R1 files (forward reads): 
for fastq_file in data/raw/*R1*fastq.gz; do
    # Some file processing...
done

# Or only filenames starting with A or B:
for fastq_file in data/raw/[AB]*fastq.gz; do
    # Some file processing...
done

With genomics data, the routine of looping over an entire directory of files, or selections made with simple globbing patterns, should serve you very well.

But in some cases, you may want to iterate only over a specific list of filenames (or partial filenames such as sample IDs) that represent a complex selection.

  • If this is a short list, you could directly specify it in the loop:

    for sample in A1 B6 D3; do
        R1=data/fastq/"$sample"_R1.fastq.gz
        R2=data/fastq/"$sample"_R2.fastq.gz
        # Some file processing...
    done
  • If it is a longer list, you could create a simple text file with one line per sample ID / filename, and use command substitution as follows:

    for fastq_file in $(cat file_of_filenames.txt); do
        # Some file processing...
    done

In cases like this, Bash arrays (basically, variables that consist of multiple values, like a vector in R) or while loops may provide more elegant solutions, but those are outside the scope of this introduction.