The main exercises will work with some FASTQ files. If you don’t care much for DNA sequence files, and perhaps start to get lost in the technicalities, make sure to carry on to the three bonus exercises.
These exercises will work with 6 FASTQ files with sequences from the V4 region of 16S rRNA, generated in a metabarcoding experiment.
The FASTQ files come in pairs: for every sample, there is a FASTQ file with forward reads (or “read 1” reads) that contains _R1_
in its filename, and a FASTQ file with corresponding reverse reads (or “read 2” reads) that contain _R2_
in its filename. So, our 6 FASTQ files consist of 3 pairs of files with forward and reverse reads for 3 different biological samples.
The sequences were generated by first amplifying environmental samples with a pair of universal 16S primers, and these primer sequences are expected to be present in the FASTQ sequences. You will search for these primer sequences below, and there are two things to be aware of:
R
means that the site can be either an A
or a G
, and an N
means that the site can be any of the four bases. See here for a complete overview of these ambiguity codes.Here are the primer sequences and their reverse complements1:
GAGTGYCAGCMGCCGCGGTAA
/ TTACCGCGGCKGCTGRCACTC
.ACGGACTACNVGGGTWTCTAAT
/ ATTAGAWACCCBNGTAGTCCGT
.Create a directory for this week’s exercises and move into it.
In these exercises, you will be working with and modifying one of the scripts in Buffalo’s Chapter 12, which is printed below. Save this script as fastq_stat_loop.sh
.
#!/bin/bash
set -e -u -o pipefail
# Specify the input samples file (3rd column = path to FASTQ file):
sample_info=samples.txt
# Create a Bash array from the third column of "$sample_info":
sample_files=($(cut -f 3 "$sample_info"))
# Loop through the array:
for fastq_file in ${sample_files[@]}; do
# Strip .fastq from each FASTQ file, and add suffix:
results_file="$(basename $fastq_file .fastq)-stats.txt"
# Run "fastq_stat" on a file:
fastq_stat "$fastq_file" > stats/$results_file
done
The FASTQ files you’ll work with are inside the directory /fs/ess/PAS1855/data/week05/fastq
. Copy these files into an appropriate directory inside your own directory for these exercises, like data/
.
Unfortunately, the fastq_stat
program referenced in Buffalo’s script is an imaginary program… So, let’s create a script fastqc_stat.sh
that actually produces a few descriptive statistics for a a FASTQC file.
Set up a script skeleton with the header lines we’ve been discussing: a shebang line and the various set
settings for robust scripts.
The script should process one command-line argument, the input file. Assign the automatic placeholder variable for the first argument to a variable with a descriptive name, like fastq_file
.
The automatic placeholder variable for the first argument is $1
.
In the sections below, you can print all your output to standard out, i.e. simply use echo
with no redirection.
Also, just for the purpose of testing the commands while developing them below, it will be convenient to assign one of the FASTQ files to a variable fastq_file
.
Recall that the name of the script is automatically stored in $0
.
The number of sequences can be assumed to be the total number of lines divided by 4 (or alternatively, the number of lines consisting only of a +
). Recall that FASTQ files have 4 lines per sequence: a header, the sequence, a +
divider, and the quality scores.
To make the division, you can use syntax like expr 12 / 4
– and you can use command substitution, $(wc -l ...)
, to insert the number of lines in the division.
cutadapt
, but we can use our grep
skills to quickly search for them.if
statement that tests whether the file name contains forward (_R1_
) or reverse (_R2_
) reads.R
, with character classes that enumerate the possible alternative bases.
The test in the if
statement should be a grep
command that examines whether the filename contains _R1_
(pipe echo
output into grep
). The grep
standard output should be redirected to /dev/null
, since we’re only interested in the exit status: if grep
finds a match, the commands following then
will be executed; if it doesn’t, the commands following else
will be executed.
You can assume that matches will only be made in the sequences themselves (and not the headers or quality scores), so you can grep
the file as is: you don’t need to first select the lines with sequences.
To replace ambiguity codes with character classes in the grep
regular expression: an R
in the primers becomes [AG]
in the grep
expression, e.g. the partial sequence ATRG
would become AT[AG]G
in your regular expression.
You’ll need both the primer and its reverse complement in a single grep
regular expression: separate them with an logical “or” (|
) and to enable this, use grep -E
for extended regular expressions.
You’ll have to select only the lines with the actual DNA sequences, and the best way of doing that is using awk
like so (see the Solution for an explanation of why this works):
awk '{ if(NR%4 == 2) }'
Next, you need to count the number of characters in each line, which is best done in the same awk
command using print length($0)
, which will print the number of characters in the line.
After that, it’s the familiar sort | uniq -c
idiom to create a count table.
Use the chmod
command.
Now, let’s modify Buffalo’s script fastq_stat_loop.sh
.
Currently, the metadata file that contains the list of files to process is hard-coded as samples.txt
. Also, the file names have to be extracted from a specific column from the metadata file. This is not a very flexible setup, and is sensitive to minor changes in a file like samples.txt
.
Change the script to let it accept a list of FASTQ file names as an argument.
Recall that the placeholder variable for the first argument that is passed to a script on the command line is $1
.
Inside the command substitution ($()
) that populates the array, you can simply use cat
instead of cut
on the file that contains the list of file names, since this file will no longer have multiple columns.
stats
– let’s instead add the output directory as a second argument to the script.
You can write an explicit test to see if the output dir exists first, but simply using mkdir -p
will also work: with the -p
option, mkdir
doesn’t complain when a dir already exists (and can also make multiple levels of directories at once).
Change the line that runs the imaginary program fastq_stat
to let it run your fastq_stat.sh
script instead. Make sure the path to your script and the path to the output file is correct.
In each iteration of the loop, let the script report which FASTQ file will be analyzed.
Now that the script takes arguments, we need another file or script to create the list of filenames and to submit the script with the appropriate arguments. Specifically, in this file, we need:
A line that creates a new file just containing the FASTQ file names (akin to column 3 from samples.txt
– but with the paths to our actual FASTQ files).
A line that runs the fastq_stat_loop.sh
script. The file that contains the list of FASTQ files should be passed to the script as the first argument, and the output directory as the second argument.
As for the actual path to the output dir, you can use whatever (relative!) path makes sense to you.
Create the file containing the code outlined above. You can save this file either as a .sh
script (e.g. fastqc_runner.sh
), or put these lines in a Markdown file inside a code block. Either option is reasonable because these lines would likely be run interactively, as opposed to the fastq_stat_loop.sh
script which will be run non-interactively. It’s still important to save these interactive commands in a file, so you know what you did and can easily reproduce it.
Make fastq_stat_loop.sh
executable and run it.
The loop script should run fastq_stat.sh
on all your FASTQ files. Check the output, which should be in one file per FASTQ file in the output dir you designated. You should be seeing that the vast majority of reads contain the primer sequences. Be proud – with a quick script that only runs for a couple of seconds, and no specialized software, you have queried hundreds of thousands of sequences!
In this exercise, you will touch up your fastqc_stat.sh
script to include tests for robustness, and to report what the script is doing.
For the tests, check whether they work (…)! For instance, to check the test for the number of arguments, try running the script with no arguments, and also with two arguments, and see if the script produces the error messages you wrote.
if
statement to check whether the FASTQ file exists / is a regular file and whether it can be read. If not, give an error and exit the script with exit code 1.
Go back to this week’s slides for an example of testing whether a file is a regular file (-f
) and whether it can be read (-r
).
To exit with exit code 1, simply use: exit 1
. This should be done after printing any error messages you, or those won’t actually be printed.
Go back to this week’s slides for an example of a very similar test.
The number of arguments that were passed to a script are automatically available in $#
.
You can test for equality using <integer> -eq <integer>
(e.g. 10 -eq 10
which will evaluate to true
) and you can negate a test ("number of argument is not equal to 1) using a !
before the comparison expression.
date
commands at the start and the end of the script, so you’ll be able to tell how long it took the script to complete.Write a shell script called longest.sh
that takes two arguments: the name of a directory and a file extension (like txt
). The script should print the name of the file that has the most lines among all files with with that extension in that directory.
Make sure the script has the shebang and set
headers, and make the script executable.
Then, run your script to learn which FASTQ file has the most lines (and sequences):
$ ./longest.sh data/fastq fastq
… should print the name of the .fastq
file in data/fastq
with the highest number of lines and therefore sequences.
You can count lines for many files at once using wc -l
: simply provide it with a globbing pattern.
This exercise is slightly modified after 1.10.3 from the CSB book. The Saavedra2013
directory can be found inside the CSB repository at CSB/unix/data/Saavedra2013
, and the following code assumes you are in the directory CSB/unix/sandbox
. (If you no longer have the repository, download it again using git clone https://github.com/CSB-book/CSB.git
.)
Saavedra and Stouffer (2013) studied several plant–pollinator networks. These can be represented as rectangular matrices where the rows are pollinators, the columns plants, a 0 indicates the absence and 1 the presence of an interaction between the plant and the pollinator.
The data of Saavedra and Stouffer (2013) can be found in the directory CSB/unix/data/Saavedra2013
.
Write a script that takes one of these files as an argument, and determines the number of rows (pollinators) and columns (plants). Note that columns are separated by spaces. Don’t forget to make your script executable and to add the standard header lines. Your script should return:
$ ./netsize.sh ../data/Saavedra2013/n1.txt
#> Filename: ../data/Saavedra2013/n1.txt
#> Number of rows: 97
#> Number of columns: 80
Write a script that prints the numbers of rows and columns for each network, taking the directory containing all the files as an argument:
$ ./netsize_all.sh ../data/Saavedra2013
#> ../data/Saavedra2013/n10.txt 14 20
#> ../data/Saavedra2013/n11.txt 270 91
#> ../data/Saavedra2013/n12.txt 7 72
#> ../data/Saavedra2013/n13.txt 61 17
#> …
To find the number of columns, use awk
and recall awk
’s NF
(number of fields => number of columns) keyword.
To combine them in a script, use command substitution to assign the result of a command to a variable. (For example: mytxtfiles=$(ls *.txt)
stores the list of .txt
files in the variable $mytxtfiles
.)
Next, you need to write a for
loop.
You can now use the script you’ve just written in combination with sort
to answer the questions (remember the option -k
to choose a column and -r
to reverse the sorting order).
You can use echo -e
to print tabs using \t
: echo -e "column1 \t column2"
.
This is slightly modified after exercise 1.10.4 from the CSB book. The Buzzard2015_data.csv
file can be found inside the CSB repository at CSB/unix/data/Buzzard2015_data.csv
.
Buzzard et al. (2016) collected data on the growth of a forest in Costa Rica. In the file Buzzard2015_data.csv
you will find a subset of their data, including taxonomic information, abundance, and biomass of trees.
1. Write a script that, for a given CSV file and column number, prints:
Don’t forget to make your script executable and add the standard header lines.
For example, running the script with:
$ ./explore.sh ../data/Buzzard2015_data.csv 7
…should return:
Column name:
biomass
Number of distinct values:
285
Minimum value:
1.048466198
Maximum value:
14897.29471
You can select a given column from a csv
file using the command cut
. Then,
The column name is going to be in the first line (header); access it with head
.
For the next few commands, you’ll need to remove the header line – the tail
trick to do so is tail -n +2
.
The number of distinct values can be found by counting the number of lines when you have sorted them and removed duplicates (using a combination of tail
, sort
and uniq
).
The minimum and maximum values can be found by combining sort
and head
(or tail
).
Rename the placeholders $1
and $2
for the command-line arguments to named variables for the file name and column number, respectively.
mkdir /fs/ess/PAS1855/users/$USER/week05/exercises/
mkdir -p data/fastq/
cp /fs/ess/PAS1855/data/week05/fastq/* data/fastq/
The first two lines of the script:
#!/bin/bash
set -u -e -o pipefail
Save the file as fastq_stat.sh
.
Add a line like this to your fastq_stat.sh
script:
fastq_file=$1
For testing the code below, we first assign one of the FASTQ filenames to the variable $fastq_file
:
fastq_file=201-S4-V4-V5_S53_L001_R1_001.fastq
Add lines like these to your fastq_stat.sh
script:
echo "$0: A script to compute basic summary stats for a FASTQ file."
echo "FASTQ file to be analyzed: $fastq_file"
Add lines like these to your fastq_stat.sh
script:
# We save the output of our commands using command substitution, $().
# In the wc -l command, use input redirection so the filename is not in the output.
n_lines=$(wc -l < "$fastq_file")
n_seqs=$(expr "$n_lines" / 4) # Use expr for arithmetics
echo "Number of sequences: $n_seqs"
Alternatively, you can use the (( ))
syntax for arithmetics – just take care that in this case, there can be no spaces between the mathematical operator and the numbers:
n_seqs=$(("$n_lines"/4))
To get the number of sequences, you can also count the number of lines that only have a +
symbol:
n_seqs=$(grep -c "^+$" $fastq_file)
Recall that +
is also a regular expression symbol, but only so in the extended regex set. Therefore, without the -E
flag to grep
, we are matching a literal +
when we use one in our expression.
Add lines like these to your fastq_stat.sh
script:
if echo "$fastq_file" | grep "_R1_" >/dev/null; then
echo "FASTQ file contains forward (R1) reads, checking for primer 1..."
n_primerF=$(grep -Ec "GAGTG[CT]CAGC[AC]GCCGCGGTAA|TTACCGCGGC[GT]GCTG[AG]CACTC" "$fastq_file")
echo "Number of forward primer sequences found: $n_primerF"
else
echo "FASTQ file contains forward (R2) reads, checking for primer 2..."
n_primerR=$(grep -Ec "ACGGACTAC[ACTG][ACG]GGGT[AT]TCTAAT|ATTAGA[AT]ACCCB[ACTG]GTAGTCCGT" "$fastq_file")
echo "Number of reverse primer sequences found: $n_primerR"
fi
To initiate the if
statement:
grep
’s output to /dev/null
, since we’re only interested in the exit status.grep
finds a match, this evaluates to “true”, and the next code block will be executed. If no match is found, the block after else
will be executed.Inside the if
statement:
grep
’s -E
option to search for either of the two primer sequences at once with |
.[CT]
in place of each ambiguity code.grep
’s -c
option to count the matches.Or, to explicitly check the file contains _R2
in its name, rather than assuming this must be the case if it doesn’t contain _R1
, you can use elif
(short for “else-if”) to add another test:
if echo "$fastq_file" | grep "_R1_" >/dev/null; then
echo "FASTQ file contains forward (R1) reads, checking for primer 1..."
n_primerF=$(grep -Ec "GAGTG[CT]CAGC[AC]GCCGCGGTAA|TTACCGCGGC[GT]GCTG[AG]CACTC" "$fastq_file")
echo "Number of forward primer sequences found: $n_primerF"
elif echo "$fastq_file" | grep "_R2_" >/dev/null; then
echo "FASTQ file contains forward (R2) reads, checking for primer 2..."
n_primerR=$(grep -Ec "ACGGACTAC[ACTG][ACG]GGGT[AT]TCTAAT|ATTAGA[AT]ACCCB[ACTG]GTAGTCCGT" "$fastq_file")
echo "Number of reverse primer sequences found: $n_primerR"
fi
When we test this code by itself, we get:
#> FASTQ file contains forward (R1) reads, checking for primer 1...
#> Number of forward primer sequences found: 44687
This looks good!
Add lines like these to your fastq_stat.sh
script:
echo "Count table of sequence lengths:"
awk '{ if(NR%4 == 2) print length($0) }' "$fastq_file" | sort | uniq -c
NR
in awk
) for which, after dividing the line number by 4, we have 2 left: NR%4 == 2
.awk
’s length
function: print length($0)
.sort | uniq -c
idiom, which will give us a count table.
Assuming it is in the working dir:
$ chmod u+x fastq_stat.sh
# Or for all scripts at once:
$ chmod u+x *sh
# Assuming you have assigned a FASTQ file to $fastq_file for testing:
./fastq_stat.sh $fastq_file
Add this line to the script:
file_list="$1"
Now, replace the following line:
# Old line:
# sample_files=($(cut -f 3 "$sample_info"))
# New line:
sample_files=($(cat "$file_list"))
output_dir="$2"
# Create the output dir, if necessary:
mkdir -p "$output_dir"
mkdir -p
will not complain if the directory already exists, and it can make multiple levels of directories at once.fastq_stat
to call your script.
# Old line:
# fastq_stat "$fastq_file" > stats/$results_file
# New line:
scripts/fastq_stat.sh "$fastq_file" > "$output_dir"/"$results_file"
The results_file
line can remain the same:
results_file="$(basename $fastq_file .fastq)-stats.txt"
Add the following line inside the loop:
echo "Running fastq_stat for FASTQ file $fastq_file"
file_list=fastq_file_list.txt
ls data/*fastq >"$file_list"
output_dir=results/fastq_stats
./fastq_stat_loop.sh "$file_list" "$output_dir"
fastq_stat_loop.sh
executable and run it.
Run these lines:
$ chmod u+x ./fastq_stat_loop.sh
$ ./fastq_stat_loop.sh "$file_list" "$output_dir"
fastq_stat_loop.sh
script.
#!/bin/bash
set -e -u -o pipefail
file_list="$1"
output_dir="$2"
# Create the output dir, if necessary:
mkdir -p "$output_dir"
# Create an array with FASTQ files
fastq_files=($(cat "$file_list"))
# Report:
echo "Number of fastq files: ${#fastq_files[@]}"
# Loop through the array:
for fastq_file in "${fastq_files[@]}"; do
echo "Running fastq_stat for FASTQ file $fastq_file"
# Strip .fastq from each FASTQ file, and add suffix:
results_file="$(basename "$fastq_file" .fastq)-stats.txt"
# Run "fastq_stat" on a file:
scripts/fastq_stat.sh "$fastq_file" >"$output_dir"/"$results_file"
done
file_list=fastq_file_list.txt
output_dir=results/fastq_stats
ls data/*fastq >"$file_list"
./fastq_stat_loop.sh "$file_list" "$output_dir"
Add these lines to your script:
! -f
will be true if the file is not a regular/existing file
! -r
will be true if the file is not readable
We use ||
to separate the two conditions with a logical or.
With exit 1
, we terminate the script with an exit code that indicates failure.
if [ ! -f "$fastq_file" ] || [ ! -r "$fastq_file" ]; then
echo "Error: can't open file"
echo "Second argument should be a readable file"
echo "You provided: $fastq_file"
exit 1
fi
$ ./fastq_stat.sh blabla
Add these lines to your script:
$#
is the number of command-line arguments passed to the script
-eq
will test whether the numbers to the left and right of it are the same, and will return true
if they are.
We negate this with !
: if the number of arguments is NOT 1, then error out.
exit 1
will exit with exit code 1, which signifies an error / failure.
if [ ! "$#" -eq 1 ]; then # If the number of args does NOT equal 1, then
echo "Error: wrong number of arguments"
echo "You provided $# arguments, while 1 is required."
echo "Usage: fastq_stat.sh <file-name>"
exit 1
fi
$ ./fastq_stat.sh # No args
$ ./fastq_stat.sh $fastq_file blabla # Two args
date
commands.
Simply include two lines with:
date
… in the script, one before file processing, and one after.
fastq_stat.sh
script.
#!/bin/bash
set -u -e -o pipefail
echo "$0: A script to compute basic summary stats for a FASTQ file."
date
echo
# Test number of args -------------------------------------------------
if [ ! "$#" -eq 1 ]; then
echo "Error: wrong number of arguments"
echo "You provided $# arguments, while 1 is required."
echo "Usage: fastq_stat.sh <file-name>"
exit 1
fi
# Process command-line args and report ------------------------------------
fastq_file="$1"
echo "FASTQ file to be analyzed: $fastq_file"
echo
if [ ! -f "$fastq_file" ] || [ ! -r "$fastq_file" ]; then
echo "Error: can't open file"
echo "Second argument should be a readable file"
echo "You provided: $fastq_file"
exit 1
fi
# Count primer sequences ------------------------------------------------
n_lines=$(wc -l <"$fastq_file")
n_seqs=$(expr "$n_lines" / 4)
echo "Number of sequences: $n_seqs"
# Count primer sequences ------------------------------------------------
if echo "$fastq_file" | grep "_R1_" >/dev/null; then
echo "FASTQ file contains forward (R1) reads, checking for primer 1..."
n_primerF=$(grep -Ec "GAGTG[CT]CAGC[AC]GCCGCGGTAA|TTACCGCGGC[GT]GCTG[AG]CACTC" "$fastq_file")
echo "Number of forward primer sequences found: $n_primerF"
elif echo "$fastq_file" | grep "_R2_" >/dev/null; then
echo "FASTQ file contains forward (R2) reads, checking for primer 2..."
n_primerR=$(grep -Ec "ACGGACTAC[ACTG][ACG]GGGT[AT]TCTAAT|ATTAGA[AT]ACCCB[ACTG]GTAGTCCGT" "$fastq_file")
echo "Number of reverse primer sequences found: $n_primerR"
fi
# Bonus: Count table of sequence lengths ---------------------------------
echo "Count table of sequence lengths:"
awk '{ if(NR%4 == 2) print length($0) }' "$fastq_file" | sort | uniq -c
# Report ----------------------------------------------------------------
echo
echo "Done with $0 for $fastq_file."
date
There are several ways to do this, here’s one example:
#!/bin/bash
set -u -e -o pipefail
dir="$1"
extension="$2"
wc -l "$dir"/*."$extension" | sort -rn | sed -n '2p'
You have to print the second rather than the the first line: the first line will be a total line number count across all files, which wc
automatically computes.
Instead of using sed
, you could also use head -n 2 | tail -n 1
to print the second line:
wc -l "$dir"/*."$extension" | sort -rn | head -n 2 | tail -n 1
(Note also that there are two files with the same number of lines: R1
and R2
files for the same sample always have the same number of reads.)
#!/bin/bash
set -u -e -o pipefail
file="$1"
echo "Filename:"
echo "$file"
# We can get the number of rows simply by counting the number of lines:
# To avoid printing the filename we redirect the input like below
# (Or we could have done: "cat ../data/Saavedra2013/n10.txt | wc -l")
echo "Number of rows:"
wc -l < "$file"
# To count the number of columns, we use awk - recall that NF is the number
# of fields (columns), and to print that number only for a single line, we exit:
echo "Number of columns:"
awk '{ print NF; exit }' "$file"
# head -n 1 "$file" | awk '{ print NF }' # Also works
We can save this script as netsize.sh
and make it executable using chmod u+x netsize.sh
.
#!/bin/bash
set -u -e -o pipefail
dir="$1"
# We can loop over the files using globbing:
for file in "$dir"/*.txt; do
# Next, we can save the number of rows and columns in variables:
n_row=$(wc -l < "$file")
n_col=$(awk '{ print NF; exit }' "$file")
# And print them all on one line:
echo -e "$file \t $n_row \t $n_col"
done
We can save this script as netsize_all.sh
and run it as follows:
./netsize_all.sh ../data/Saavedra2013
# Having written the script netsize_all.sh,
# you can take its output and order it according to rows or columns.
# Sorting by column 2 gives you the file with the largest number of rows:
$ ./netsize_all.sh | sort -n -r -k 2 | head -n 1
#> ../data/Saavedra2013/n58.txt 678 90
# Sorting by column 3 gives you the file with the largest number of columns:
$ ./netsize_all.sh | sort -n -r -k 3 | head -n 1
#> ../data/Saavedra2013/n56.txt 110 207
#!/bin/bash
set -u -e -o pipefail
file=$1 # $1 is the file name
column=$2 # $2 is the column of interest
echo "Column name:"
cut -d ',' -f "$column" "$file" | head -n 1
# In the next lines, we need to skip the header, which we can do using
# tail -n +2
echo "Number of distinct values:"
cut -d ',' -f "$column" "$file" | tail -n +2 | sort | uniq | wc -l
echo "Minimum value:"
cut -d ',' -f "$column" "$file" | tail -n +2 | sort -n | head -n 1
echo "Maximum value:"
cut -d ',' -f "$column" "$file" | tail -n +2 | sort -n | tail -n 1
explore.sh
, and make it executable, we can run it using:./explore.sh ../data/Buzzard2015_data.csv 6
# Column name
# Abund.n
# Number of distinct values:
# 46
# Minimum value:
# 1
# Maximum value:
# 157
# This works well also for alphabetical order:
./explore.sh ../data/Buzzard2015_data.csv 3
# Column name
# genus
# Number of distinct values:
# 85
# Minimum value:
# Acacia
# Maximum value:
# Zanthoxylum
There initially was an error in these primer sequences (one sequence repeated twice), which has been corrected on Friday, Feb 12.↩︎
This exercise was slightly modified from Software Carpentry’s Shell Novice tutorial.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".