pracs-sp21: Week 5 exercises

The main exercises will work with some FASTQ files. If you don’t care much for DNA sequence files, and perhaps start to get lost in the technicalities, make sure to carry on to the three bonus exercises.

Main exercises

Background

These exercises will work with 6 FASTQ files with sequences from the V4 region of 16S rRNA, generated in a metabarcoding experiment.

The FASTQ files come in pairs: for every sample, there is a FASTQ file with forward reads (or “read 1” reads) that contains _R1_ in its filename, and a FASTQ file with corresponding reverse reads (or “read 2” reads) that contain _R2_ in its filename. So, our 6 FASTQ files consist of 3 pairs of files with forward and reverse reads for 3 different biological samples.

The sequences were generated by first amplifying environmental samples with a pair of universal 16S primers, and these primer sequences are expected to be present in the FASTQ sequences. You will search for these primer sequences below, and there are two things to be aware of:

A primer can also be present in the FASTQ sequence as its reverse complement, so we will search for reverse complements too.
The primers contain a few variable sites, which are indicated using ambiguity codes. For instance, an R means that the site can be either an A or a G, and an N means that the site can be any of the four bases. See here for a complete overview of these ambiguity codes.

Here are the primer sequences and their reverse complements¹:

Forward primer (“515F”): GAGTGYCAGCMGCCGCGGTAA / TTACCGCGGCKGCTGRCACTC.
Reverse primer (“806R”): ACGGACTACNVGGGTWTCTAAT / ATTAGAWACCCBNGTAGTCCGT.

Exercise 1: Setting up

Create a directory for this week’s exercises and move into it.

In these exercises, you will be working with and modifying one of the scripts in Buffalo’s Chapter 12, which is printed below. Save this script as fastq_stat_loop.sh.

#!/bin/bash
set -e -u -o pipefail

# Specify the input samples file (3rd column = path to FASTQ file):
sample_info=samples.txt

# Create a Bash array from the third column of "$sample_info":
sample_files=($(cut -f 3 "$sample_info"))

# Loop through the array:
for fastq_file in ${sample_files[@]}; do

  # Strip .fastq from each FASTQ file, and add suffix:
  results_file="$(basename $fastq_file .fastq)-stats.txt"

  # Run "fastq_stat" on a file:
  fastq_stat "$fastq_file" > stats/$results_file

done

The FASTQ files you’ll work with are inside the directory /fs/ess/PAS1855/data/week05/fastq. Copy these files into an appropriate directory inside your own directory for these exercises, like data/.

Exercise 2: Create a script to compute stats for a FASTQ file

Unfortunately, the fastq_stat program referenced in Buffalo’s script is an imaginary program… So, let’s create a script fastqc_stat.sh that actually produces a few descriptive statistics for a a FASTQC file.

Set up a script skeleton with the header lines we’ve been discussing: a shebang line and the various set settings for robust scripts.
The script should process one command-line argument, the input file. Assign the automatic placeholder variable for the first argument to a variable with a descriptive name, like fastq_file.

Hints

The automatic placeholder variable for the first argument is $1.

In the sections below, you can print all your output to standard out, i.e. simply use echo with no redirection.

Also, just for the purpose of testing the commands while developing them below, it will be convenient to assign one of the FASTQ files to a variable fastq_file.

Have the script report its own name as well as the name of the FASTQ file that is being processed.

Hints

Recall that the name of the script is automatically stored in $0.

Compute and report the number of sequences in the FASTQ file.

Hints

The number of sequences can be assumed to be the total number of lines divided by 4 (or alternatively, the number of lines consisting only of a +). Recall that FASTQ files have 4 lines per sequence: a header, the sequence, a + divider, and the quality scores.

To make the division, you can use syntax like expr 12 / 4 – and you can use command substitution, $(wc -l ...), to insert the number of lines in the division.

As mentioned in the Background section, the primer sequences should be present in the FASTQ files. Prior to sequence analyses, these are usually removed with a specialized program like cutadapt, but we can use our grep skills to quickly search for them.

We will search for the forward primers only in the forwards reads, and for the reverse primers only in the reverse reads. Therefore, start with an if statement that tests whether the file name contains forward (_R1_) or reverse (_R2_) reads.

If it contains forward reads, you should be counting occurrences of the forward primer or its reverse complement. Similarly, if it contains reverse reads you should be counting occurrences of the reverse primer or its reverse complement.

You’ll also have to replace the ambiguity codes, like R, with character classes that enumerate the possible alternative bases.

Hints

The test in the if statement should be a grep command that examines whether the filename contains _R1_ (pipe echo output into grep). The grep standard output should be redirected to /dev/null, since we’re only interested in the exit status: if grep finds a match, the commands following then will be executed; if it doesn’t, the commands following else will be executed.

You can assume that matches will only be made in the sequences themselves (and not the headers or quality scores), so you can grep the file as is: you don’t need to first select the lines with sequences.

To replace ambiguity codes with character classes in the grep regular expression: an R in the primers becomes [AG] in the grep expression, e.g. the partial sequence ATRG would become AT[AG]G in your regular expression.

You’ll need both the primer and its reverse complement in a single grep regular expression: separate them with an logical “or” (|) and to enable this, use grep -E for extended regular expressions.

Bonus: Print a count table of sequence lengths.

Hints

You’ll have to select only the lines with the actual DNA sequences, and the best way of doing that is using awk like so (see the Solution for an explanation of why this works):

awk '{ if(NR%4 == 2) }'

Next, you need to count the number of characters in each line, which is best done in the same awk command using print length($0), which will print the number of characters in the line.

After that, it’s the familiar sort | uniq -c idiom to create a count table.

Make the script executable.

Hints

Use the chmod command.

Run the script for a single FASTQ file by calling it from the command line.

Exercise 3: Modify the looping script

Now, let’s modify Buffalo’s script fastq_stat_loop.sh.

Currently, the metadata file that contains the list of files to process is hard-coded as samples.txt. Also, the file names have to be extracted from a specific column from the metadata file. This is not a very flexible setup, and is sensitive to minor changes in a file like samples.txt.

Change the script to let it accept a list of FASTQ file names as an argument.

Hints

Recall that the placeholder variable for the first argument that is passed to a script on the command line is $1.

Inside the command substitution ($()) that populates the array, you can simply use cat instead of cut on the file that contains the list of file names, since this file will no longer have multiple columns.

Currently, the output directory is also hard-coded, as stats – let’s instead add the output directory as a second argument to the script.
Moreover, add code that creates this output directory if it doesn’t already exist.

Hints

You can write an explicit test to see if the output dir exists first, but simply using mkdir -p will also work: with the -p option, mkdir doesn’t complain when a dir already exists (and can also make multiple levels of directories at once).

Change the line that runs the imaginary program fastq_stat to let it run your fastq_stat.sh script instead. Make sure the path to your script and the path to the output file is correct.
In each iteration of the loop, let the script report which FASTQ file will be analyzed.
Now that the script takes arguments, we need another file or script to create the list of filenames and to submit the script with the appropriate arguments. Specifically, in this file, we need:
- A line that creates a new file just containing the FASTQ file names (akin to column 3 from samples.txt – but with the paths to our actual FASTQ files).
- A line that runs the fastq_stat_loop.sh script. The file that contains the list of FASTQ files should be passed to the script as the first argument, and the output directory as the second argument.
  
  As for the actual path to the output dir, you can use whatever (relative!) path makes sense to you.

Create the file containing the code outlined above. You can save this file either as a .sh script (e.g. fastqc_runner.sh), or put these lines in a Markdown file inside a code block. Either option is reasonable because these lines would likely be run interactively, as opposed to the fastq_stat_loop.sh script which will be run non-interactively. It’s still important to save these interactive commands in a file, so you know what you did and can easily reproduce it.

Make fastq_stat_loop.sh executable and run it.

The loop script should run fastq_stat.sh on all your FASTQ files. Check the output, which should be in one file per FASTQ file in the output dir you designated. You should be seeing that the vast majority of reads contain the primer sequences. Be proud – with a quick script that only runs for a couple of seconds, and no specialized software, you have queried hundreds of thousands of sequences!

Exercise 4: Bells and whistles

In this exercise, you will touch up your fastqc_stat.sh script to include tests for robustness, and to report what the script is doing.

For the tests, check whether they work (…)! For instance, to check the test for the number of arguments, try running the script with no arguments, and also with two arguments, and see if the script produces the error messages you wrote.

Write an if statement to check whether the FASTQ file exists / is a regular file and whether it can be read. If not, give an error and exit the script with exit code 1.

Hints

Go back to this week’s slides for an example of testing whether a file is a regular file (-f) and whether it can be read (-r).

To exit with exit code 1, simply use: exit 1. This should be done after printing any error messages you, or those won’t actually be printed.

Test whether exactly 1 argument was passed to the script on the command line. If not, return an error, succinctly report how to use the script (“usage: …”), and exit with exit code 1.

Hints

Go back to this week’s slides for an example of a very similar test.

The number of arguments that were passed to a script are automatically available in $#.

You can test for equality using <integer> -eq <integer> (e.g. 10 -eq 10 which will evaluate to true) and you can negate a test ("number of argument is not equal to 1) using a ! before the comparison expression.

Add date commands at the start and the end of the script, so you’ll be able to tell how long it took the script to complete.

Bonus exercises

Exercise 5: Find the longest file²

Write a shell script called longest.sh that takes two arguments: the name of a directory and a file extension (like txt). The script should print the name of the file that has the most lines among all files with with that extension in that directory.

Make sure the script has the shebang and set headers, and make the script executable.

Then, run your script to learn which FASTQ file has the most lines (and sequences):

$ ./longest.sh data/fastq fastq

… should print the name of the .fastq file in data/fastq with the highest number of lines and therefore sequences.

Hints

You can count lines for many files at once using wc -l: simply provide it with a globbing pattern.

Exercise 6: Plant-pollinator networks

This exercise is slightly modified after 1.10.3 from the CSB book. The Saavedra2013 directory can be found inside the CSB repository at CSB/unix/data/Saavedra2013, and the following code assumes you are in the directory CSB/unix/sandbox. (If you no longer have the repository, download it again using git clone https://github.com/CSB-book/CSB.git.)

Saavedra and Stouffer (2013) studied several plant–pollinator networks. These can be represented as rectangular matrices where the rows are pollinators, the columns plants, a 0 indicates the absence and 1 the presence of an interaction between the plant and the pollinator.

The data of Saavedra and Stouffer (2013) can be found in the directory CSB/unix/data/Saavedra2013.

Write a script that takes one of these files as an argument, and determines the number of rows (pollinators) and columns (plants). Note that columns are separated by spaces. Don’t forget to make your script executable and to add the standard header lines. Your script should return:
```
$ ./netsize.sh ../data/Saavedra2013/n1.txt

#> Filename: ../data/Saavedra2013/n1.txt
#> Number of rows: 97
#> Number of columns: 80
```

Write a script that prints the numbers of rows and columns for each network, taking the directory containing all the files as an argument:

$ ./netsize_all.sh ../data/Saavedra2013

#> ../data/Saavedra2013/n10.txt     14      20
#> ../data/Saavedra2013/n11.txt     270     91
#> ../data/Saavedra2013/n12.txt     7       72
#> ../data/Saavedra2013/n13.txt     61      17
#> …

Hints

To find the number of columns, use awk and recall awk’s NF (number of fields => number of columns) keyword.

To combine them in a script, use command substitution to assign the result of a command to a variable. (For example: mytxtfiles=$(ls *.txt) stores the list of .txt files in the variable $mytxtfiles.)

Next, you need to write a for loop.

You can now use the script you’ve just written in combination with sort to answer the questions (remember the option -k to choose a column and -r to reverse the sorting order).

You can use echo -e to print tabs using \t: echo -e "column1 \t column2".

Exercise 7: Data explorer

This is slightly modified after exercise 1.10.4 from the CSB book. The Buzzard2015_data.csv file can be found inside the CSB repository at CSB/unix/data/Buzzard2015_data.csv.

Buzzard et al. (2016) collected data on the growth of a forest in Costa Rica. In the file Buzzard2015_data.csv you will find a subset of their data, including taxonomic information, abundance, and biomass of trees.

1. Write a script that, for a given CSV file and column number, prints:

The corresponding column name;
The number of distinct values in the column;
The minimum value;
The maximum value.

Don’t forget to make your script executable and add the standard header lines.

For example, running the script with:

$ ./explore.sh ../data/Buzzard2015_data.csv 7

…should return:

Column name:
biomass
Number of distinct values:
285
Minimum value:
1.048466198
Maximum value:
14897.29471

Hints

You can select a given column from a csv file using the command cut. Then,

The column name is going to be in the first line (header); access it with head.
For the next few commands, you’ll need to remove the header line – the tail trick to do so is tail -n +2.
The number of distinct values can be found by counting the number of lines when you have sorted them and removed duplicates (using a combination of tail, sort and uniq).
The minimum and maximum values can be found by combining sort and head (or tail).
Rename the placeholders $1 and $2 for the command-line arguments to named variables for the file name and column number, respectively.

Solutions

Exercise 1

(1.) Create a directory for these exercises.

mkdir /fs/ess/PAS1855/users/$USER/week05/exercises/

(3.) Copy the FASTQ files into your own dir.

mkdir -p data/fastq/
cp /fs/ess/PAS1855/data/week05/fastq/* data/fastq/

Exercise 2

(1.) Create a new script with header lines.

The first two lines of the script:

#!/bin/bash
set -u -e -o pipefail

Save the file as fastq_stat.sh.

(2.) Assign the first argument to a named variable.

Add a line like this to your fastq_stat.sh script:

fastq_file=$1

(3.) Let the script report its name and the name of the FASTQ file.

For testing the code below, we first assign one of the FASTQ filenames to the variable $fastq_file:

fastq_file=201-S4-V4-V5_S53_L001_R1_001.fastq

Add lines like these to your fastq_stat.sh script:

echo "$0: A script to compute basic summary stats for a FASTQ file."
echo "FASTQ file to be analyzed: $fastq_file"

(4.) Compute and report the number of sequences in the FASTQ file.

Add lines like these to your fastq_stat.sh script:

# We save the output of our commands using command substitution, $().
# In the wc -l command, use input redirection so the filename is not in the output.
n_lines=$(wc -l < "$fastq_file")
n_seqs=$(expr "$n_lines" / 4)     # Use expr for arithmetics

echo "Number of sequences: $n_seqs"

Alternatively, you can use the (( )) syntax for arithmetics – just take care that in this case, there can be no spaces between the mathematical operator and the numbers:
```
n_seqs=$(("$n_lines"/4))
```
To get the number of sequences, you can also count the number of lines that only have a + symbol:
```
n_seqs=$(grep -c "^+$" $fastq_file)
```
Recall that + is also a regular expression symbol, but only so in the extended regex set. Therefore, without the -E flag to grep, we are matching a literal + when we use one in our expression.

(5.) Search for adapter sequences.

Add lines like these to your fastq_stat.sh script:

if echo "$fastq_file" | grep "_R1_" >/dev/null; then

  echo "FASTQ file contains forward (R1) reads, checking for primer 1..."

  n_primerF=$(grep -Ec "GAGTG[CT]CAGC[AC]GCCGCGGTAA|TTACCGCGGC[GT]GCTG[AG]CACTC" "$fastq_file")
  echo "Number of forward primer sequences found: $n_primerF"

else

  echo "FASTQ file contains forward (R2) reads, checking for primer 2..."

  n_primerR=$(grep -Ec "ACGGACTAC[ACTG][ACG]GGGT[AT]TCTAAT|ATTAGA[AT]ACCCB[ACTG]GTAGTCCGT" "$fastq_file")
  echo "Number of reverse primer sequences found: $n_primerR"

fi

To initiate the if statement:

We redirect grep’s output to /dev/null, since we’re only interested in the exit status.
If grep finds a match, this evaluates to “true”, and the next code block will be executed. If no match is found, the block after else will be executed.

Inside the if statement:

We use the grep’s -E option to search for either of the two primer sequences at once with |.
We use character classes like [CT] in place of each ambiguity code.
We use grep’s -c option to count the matches.

Or, to explicitly check the file contains _R2 in its name, rather than assuming this must be the case if it doesn’t contain _R1, you can use elif (short for “else-if”) to add another test:

if echo "$fastq_file" | grep "_R1_" >/dev/null; then

  echo "FASTQ file contains forward (R1) reads, checking for primer 1..."

  n_primerF=$(grep -Ec "GAGTG[CT]CAGC[AC]GCCGCGGTAA|TTACCGCGGC[GT]GCTG[AG]CACTC" "$fastq_file")
  echo "Number of forward primer sequences found: $n_primerF"

elif echo "$fastq_file" | grep "_R2_" >/dev/null; then

  echo "FASTQ file contains forward (R2) reads, checking for primer 2..."

  n_primerR=$(grep -Ec "ACGGACTAC[ACTG][ACG]GGGT[AT]TCTAAT|ATTAGA[AT]ACCCB[ACTG]GTAGTCCGT" "$fastq_file")
  echo "Number of reverse primer sequences found: $n_primerR"

fi

When we test this code by itself, we get:

#> FASTQ file contains forward (R1) reads, checking for primer 1...
#> Number of forward primer sequences found: 44687

This looks good!

(6.) Bonus: Print a count table of sequence lengths.

Add lines like these to your fastq_stat.sh script:

echo "Count table of sequence lengths:"
awk '{ if(NR%4 == 2) print length($0) }' "$fastq_file" | sort | uniq -c

To select only actual sequences, and not other lines, from the FASTQ file, we use the following trick. We know that in the FASTQ file, every fourth line, starting from line number 2, contains the actual sequence (i.e.: line 2, line 6, line 10, etc). To select these lines we can use the “modulo” operator to select only line numbers (NR in awk) for which, after dividing the line number by 4, we have 2 left: NR%4 == 2.
Next we print the number of characters on the entire line using awk’s length function: print length($0).
Finally, we pass the sequence lengths on to the sort | uniq -c idiom, which will give us a count table.

(7.) Make the script executable.

Assuming it is in the working dir:

$ chmod u+x fastq_stat.sh 

# Or for all scripts at once:
$ chmod u+x *sh

(8.) Run the script for a single FASTQ file by calling it from the command line.

# Assuming you have assigned a FASTQ file to $fastq_file for testing:
./fastq_stat.sh $fastq_file

Exercise 3

(1.) Change the script to let it accept a list of FASTQ file names as an argument.

Add this line to the script:

file_list="$1"

Now, replace the following line:

# Old line:
# sample_files=($(cut -f 3 "$sample_info"))

# New line:
sample_files=($(cat "$file_list"))

(2.) Add the output directory as a second argument to the script, and create the output dir if necessary.

output_dir="$2"

# Create the output dir, if necessary:
mkdir -p "$output_dir"

mkdir -p will not complain if the directory already exists, and it can make multiple levels of directories at once.

(3.) Modify the line in the script that calls fastq_stat to call your script.

# Old line:
# fastq_stat "$fastq_file" > stats/$results_file

# New line:
scripts/fastq_stat.sh "$fastq_file" > "$output_dir"/"$results_file"

The results_file line can remain the same:

results_file="$(basename $fastq_file .fastq)-stats.txt"

(4.) In each iteration of the loop, let the script report which FASTQ will be analyzed.

Add the following line inside the loop:

echo "Running fastq_stat for FASTQ file $fastq_file"

(5.) Create a second file/script to create a list of FASTQ files and to run the loop script.

To create a list of FASTQ files:

file_list=fastq_file_list.txt

ls data/*fastq >"$file_list"

To run the loop script:

output_dir=results/fastq_stats

./fastq_stat_loop.sh "$file_list" "$output_dir"

(6.) Make fastq_stat_loop.sh executable and run it.

Run these lines:

$ chmod u+x ./fastq_stat_loop.sh

$ ./fastq_stat_loop.sh "$file_list" "$output_dir"

The final fastq_stat_loop.sh script.

#!/bin/bash
set -e -u -o pipefail

file_list="$1"
output_dir="$2"

# Create the output dir, if necessary:
mkdir -p "$output_dir"

# Create an array with FASTQ files
fastq_files=($(cat "$file_list"))

# Report:
echo "Number of fastq files: ${#fastq_files[@]}"

# Loop through the array:
for fastq_file in "${fastq_files[@]}"; do

  echo "Running fastq_stat for FASTQ file $fastq_file"

  # Strip .fastq from each FASTQ file, and add suffix:
  results_file="$(basename "$fastq_file" .fastq)-stats.txt"

  # Run "fastq_stat" on a file:
  scripts/fastq_stat.sh "$fastq_file" >"$output_dir"/"$results_file"

done

The final lines in the runner script / Markdown code block.

file_list=fastq_file_list.txt
output_dir=results/fastq_stats

ls data/*fastq >"$file_list"

./fastq_stat_loop.sh "$file_list" "$output_dir"

Exercise 4

(1.) Check whether the FASTQ file is a regular file than can be read.

Add these lines to your script:
- ! -f will be true if the file is not a regular/existing file
- ! -r will be true if the file is not readable
- We use || to separate the two conditions with a logical or.
- With exit 1, we terminate the script with an exit code that indicates failure.

if [ ! -f "$fastq_file" ] || [ ! -r "$fastq_file" ]; then
  echo "Error: can't open file"
  echo "Second argument should be a readable file"
  echo "You provided: $fastq_file"
  exit 1
fi

To test your test:

$ ./fastq_stat.sh blabla

(2.) Check whether only one argument was provided.

Add these lines to your script:
- $# is the number of command-line arguments passed to the script
- -eq will test whether the numbers to the left and right of it are the same, and will return true if they are.
- We negate this with !: if the number of arguments is NOT 1, then error out.
- exit 1 will exit with exit code 1, which signifies an error / failure.

if [ ! "$#" -eq 1 ]; then   # If the number of args does NOT equal 1, then
  echo "Error: wrong number of arguments"
  echo "You provided $# arguments, while 1 is required."
  echo "Usage: fastq_stat.sh <file-name>"
  exit 1
fi

To test your test:

$ ./fastq_stat.sh                      # No args
$ ./fastq_stat.sh $fastq_file blabla   # Two args

(3.) Add date commands.

Simply include two lines with:

date

… in the script, one before file processing, and one after.

The final fastq_stat.sh script.

#!/bin/bash
set -u -e -o pipefail

echo "$0: A script to compute basic summary stats for a FASTQ file."
date
echo

# Test number of args -------------------------------------------------
if [ ! "$#" -eq 1 ]; then
  echo "Error: wrong number of arguments"
  echo "You provided $# arguments, while 1 is required."
  echo "Usage: fastq_stat.sh <file-name>"
  exit 1
fi

# Process command-line args and report ------------------------------------

fastq_file="$1"

echo "FASTQ file to be analyzed: $fastq_file"
echo

if [ ! -f "$fastq_file" ] || [ ! -r "$fastq_file" ]; then
  echo "Error: can't open file"
  echo "Second argument should be a readable file"
  echo "You provided: $fastq_file"
  exit 1
fi

# Count primer sequences ------------------------------------------------

n_lines=$(wc -l <"$fastq_file")
n_seqs=$(expr "$n_lines" / 4)

echo "Number of sequences: $n_seqs"

# Count primer sequences ------------------------------------------------

if echo "$fastq_file" | grep "_R1_" >/dev/null; then

  echo "FASTQ file contains forward (R1) reads, checking for primer 1..."

  n_primerF=$(grep -Ec "GAGTG[CT]CAGC[AC]GCCGCGGTAA|TTACCGCGGC[GT]GCTG[AG]CACTC" "$fastq_file")
  echo "Number of forward primer sequences found: $n_primerF"

elif echo "$fastq_file" | grep "_R2_" >/dev/null; then

  echo "FASTQ file contains forward (R2) reads, checking for primer 2..."

  n_primerR=$(grep -Ec "ACGGACTAC[ACTG][ACG]GGGT[AT]TCTAAT|ATTAGA[AT]ACCCB[ACTG]GTAGTCCGT" "$fastq_file")
  echo "Number of reverse primer sequences found: $n_primerR"

fi

# Bonus: Count table of sequence lengths ---------------------------------
echo "Count table of sequence lengths:"
awk '{ if(NR%4 == 2) print length($0) }' "$fastq_file" | sort | uniq -c

# Report ----------------------------------------------------------------
echo
echo "Done with $0 for $fastq_file."
date

Exercise 5

Solution

There are several ways to do this, here’s one example:

#!/bin/bash
set -u -e -o pipefail

dir="$1"
extension="$2"

wc -l "$dir"/*."$extension" | sort -rn | sed -n '2p'

You have to print the second rather than the the first line: the first line will be a total line number count across all files, which wc automatically computes.

Instead of using sed, you could also use head -n 2 | tail -n 1 to print the second line:

wc -l "$dir"/*."$extension" | sort -rn | head -n 2 | tail -n 1

(Note also that there are two files with the same number of lines: R1 and R2 files for the same sample always have the same number of reads.)

Exercise 6

(1.) Write a script that takes one file and determines the number of rows and columns.

#!/bin/bash
set -u -e -o pipefail

file="$1"

echo "Filename:"
echo "$file"

# We can get the number of rows simply by counting the number of lines:
# To avoid printing the filename we redirect the input like below
# (Or we could have done: "cat ../data/Saavedra2013/n10.txt | wc -l")
echo "Number of rows:"
wc -l < "$file"

# To count the number of columns, we use awk - recall that NF is the number
# of fields (columns), and to print that number only for a single line, we exit:
echo "Number of columns:"
awk '{ print NF; exit }' "$file"

# head -n 1 "$file" | awk '{ print NF }' # Also works

We can save this script as netsize.sh and make it executable using chmod u+x netsize.sh.

(2.) Write a script that prints the number of rows and columns for each network.

#!/bin/bash
set -u -e -o pipefail

dir="$1"

# We can loop over the files using globbing:
for file in "$dir"/*.txt; do

    # Next, we can save the number of rows and columns in variables:
    n_row=$(wc -l < "$file")
    n_col=$(awk '{ print NF; exit }' "$file")
    
    # And print them all on one line:
    echo -e "$file \t $n_row \t $n_col"
done

We can save this script as netsize_all.sh and run it as follows:

./netsize_all.sh ../data/Saavedra2013

(3.) Which network has the largest number of rows and which the largest number of columns?

# Having written the script netsize_all.sh,
# you can take its output and order it according to rows or columns.

# Sorting by column 2 gives you the file with the largest number of rows:
$ ./netsize_all.sh | sort -n -r -k 2 | head -n 1
#> ../data/Saavedra2013/n58.txt 678 90

# Sorting by column 3 gives you the file with the largest number of columns:

$ ./netsize_all.sh | sort -n -r -k 3 | head -n 1
#>  ../data/Saavedra2013/n56.txt 110 207

Exercise 7

Solution

#!/bin/bash

set -u -e -o pipefail

file=$1    # $1 is the file name
column=$2  # $2 is the column of interest

echo "Column name:"
cut -d ',' -f "$column" "$file" | head -n 1

# In the next lines, we need to skip the header, which we can do using
# tail -n +2
echo "Number of distinct values:"
cut -d ',' -f "$column" "$file" | tail -n +2 | sort | uniq | wc -l

echo "Minimum value:"
cut -d ',' -f "$column" "$file" | tail -n +2 | sort -n | head -n 1

echo "Maximum value:"
cut -d ',' -f "$column" "$file" | tail -n +2 | sort -n | tail -n 1

If we save the script as explore.sh, and make it executable, we can run it using:

./explore.sh ../data/Buzzard2015_data.csv 6

# Column name
# Abund.n
# Number of distinct values:
# 46
# Minimum value:
# 1
# Maximum value:
# 157

# This works well also for alphabetical order:

./explore.sh ../data/Buzzard2015_data.csv 3

# Column name
# genus
# Number of distinct values:
# 85
# Minimum value:
# Acacia
# Maximum value:
# Zanthoxylum

There initially was an error in these primer sequences (one sequence repeated twice), which has been corrected on Friday, Feb 12.↩︎
This exercise was slightly modified from Software Carpentry’s Shell Novice tutorial.↩︎

Week 5 exercises

Main exercises

Background

Exercise 1: Setting up

Exercise 2: Create a script to compute stats for a FASTQ file

Exercise 3: Modify the looping script

Exercise 4: Bells and whistles

Bonus exercises

Exercise 5: Find the longest file2

Exercise 6: Plant-pollinator networks

Exercise 7: Data explorer

Solutions

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Reuse

Exercise 5: Find the longest file²