pracs-sp21: Week 1 Exercises

The following exercises were copied from Chapter 1 of the CSB book, with hints and solutions modified from those provided in the book’s Github repo.

(The exercises marked as “Advanced” are omitted since they require topics not actually covered in the chapter, which we will cover in later modules.)

Main exercises

Getting set up

You should already have the book’s Github repository with exercise files.
If not, go to /fs/ess/PAS1855/users/$USER, and run git clone https://github.com/CSB-book/CSB.git.

Now, you should have the CSB directory referred to in the first step of the first exercise.

Intermezzo 1.1

(a) ~~Go to your home directory~~. Go to /fs/ess/PAS1855/users/$USER

(b) Navigate to the sandbox directory within the CSB/unix directory.

(c) Use a relative path to go to the data directory within the python directory.

Show hints

Recall that .. is one level up in the dir tree, and that you can combine multiple .. in a single statement.

(d) Use an absolute path to go to the sandbox directory within python.

(e) Return to the data directory within the python directory.

Show hints

Recall that there is a shortcut to return to your most recent working dir.

Intermezzo 1.2

To familiarize yourself with these basic Unix commands, try the following:

(a) Go to the data directory within CSB/unix.

(b) How many lines are in file Marra2014_data.fasta?

(c) Create the empty file toremove.txt in the CSB/unix/sandbox directory without leaving the current directory.

Show hints

The touch command creates an empty file.

(d) List the contents of the directory unix/sandbox.

(e) Remove the file toremove.txt.

Intermezzo 1.3

(a) If we order all species names (fifth column) of Pacifici2013_data.csv
in alphabetical order, which is the first species? Which is the last?

Show hints

You can either first select the 5th column using cut, and then use sort, or directly tell the sort command which column to sort by.

In either case, you’ll also need to specify the column delimiter.

To view just the first or the last line, pipe to head or tail.

(b) How many families are represented in the database?

Show hints

Start by selecting the relevant column.
Use a tail trick shown in the chapter to exclude the first line.
Remember to sort before using uniq.

1.10.1 Next-Generation Sequencing Data

In this exercise, we work with next generation sequencing (NGS) data. Unix is excellent at manipulating the huge FASTA files that are generated in NGS experiments.

FASTA files contain sequence data in text format. Each sequence segment is preceded by a single-line description. The first character of the description line is a “greater than” sign (>).

The NGS data set we will be working with was published by Marra and DeWoody (2014), who investigated the immunogenetic repertoire of rodents. You will find the sequence file Marra2014_data.fasta in the directory CSB/unix/data. The file contains sequence segments (contigs) of variable size. The description of each contig provides its length, the number of reads that contributed to the contig, its isogroup (representing the collection of alternative splice products of a possible gene), and the isotig status.

1. Change directory to CSB/unix/sandbox.

2. What is the size of the file Marra2014_data.fasta?

Show hints

To show the size of a file you can use the -l option of the command ls, and to display human-readable file sizes, the -h option.

3. Create a copy of Marra2014_data.fasta in the sandbox, and name it my_file.fasta.

Show hints

Recall that with the cp command, it is also possible to give the copy a new name at once.

4. How many contigs are classified as isogroup00036?

Show hints

Is there a grep option that counts the number of occurrences?
Alternatively, you can pass the output of grep to wc -l.

5. Replace the original “two-spaces” delimiter with a comma.

Show hints

In the file, the information on each contig is separated by two spaces:

>contig00001  length=527  numreads=2  ...

We would like to obtain:

>contig00001,length=527,numreads=2,...

Use cat to print the file, and substitute the spaces using the command tr. Note that you’ll first have to reduce the two spaces two one – can you remember an option to do that?

In Linux, we can’t write output to a file that also serves as input (see here), so the following is not possible:

cat myfile.txt | tr "a" "b" > myfile.txt # Don't do this!

In this case, we will have to save the output in a temporary file and on a separate line, overwrite the original file using mv.

6. How many unique isogroups are in the file?

Show hints

You can use grep to match any line containing the word isogroup. Then, use cut to isolate the part detailing the isogroup. Finally, you want to remove the duplicates, and count.

7. Which contig has the highest number of reads (numreads)? How many reads does it have?

Show hints

Use a combination of grep and cut to extract the contig names and read counts. The command sort allows you to choose the delimiter and to order numerically.

Bonus exercises

1.10.2 Hormone Levels in Baboons

Gesquiere et al. (2011) studied hormone levels in the blood of baboons. Every individual was sampled several times.

How many times were the levels of individuals 3 and 27 recorded?

Show hints

You can use cut to extract just the maleID from the file.
To match an individual (3 or 27), you can use grep with the -w option to match whole “words” only: this will prevent and individual ID like “13” to match when you search for “3”.

Write a script taking as input the file name and the ID of the individual, and returning the number of records for that ID.

Show hints

You want to turn the solution of part 1 into a script; to do so, open a new file and copy the code you’ve written.
In the script, you can use the first two so-called positional parameters, $1 and $2, to represent the file name and the maleID, respectively.
$1 and $2 represent the first and the second argument passed to a script like so: bash myscript.sh arg1 arg2.

1.10.3 Plant–Pollinator Networks

Saavedra and Stouffer (2013) studied several plant–pollinator networks. These can be represented as rectangular matrices where the rows are pollinators, the columns plants, a 0 indicates the absence and 1 the presence of an interaction between the plant and the pollinator.

The data of Saavedra and Stouffer (2013) can be found in the directory CSB/unix/data/Saavedra2013.

Write a script that takes one of these files and determines the number of rows (pollinators) and columns (plants). Note that columns are separated by spaces and that there is a space at the end of each line. Your script should return:

$ bash netsize.sh ../data/Saavedra2013/n1.txt
# Filename: ../data/Saavedra2013/n1.txt
# Number of rows: 97
# Number of columns: 80

Show hints

To build the script, you need to combine several commands:

To find the number of rows, you can use wc.
To find the number of columns, take the first line, remove the spaces, remove the line terminator \n, and count characters.

1.10.4 Data Explorer

Buzzard et al. (2016) collected data on the growth of a forest in Costa Rica. In the file Buzzard2015_data.csv you will find a subset of their data, including taxonomic information, abundance, and biomass of trees.

Write a script that, for a given CSV file and column number, prints:

The corresponding column name;
The number of distinct values in the column;
The minimum value;
The maximum value.

For example, running the script with:

$ bash explore.sh ../data/Buzzard2015_data.csv 7

should return:

Column name:
biomass
Number of distinct values:
285
Minimum value:
1.048466198
Maximum value:
14897.29471

Show hints

You can select a given column from a csv file using the command cut. Then:
- The column name is going to be in the first line (header); access it with head.
- The number of distinct values can be found by counting the number of lines when you have sorted them and removed duplicates (using a combination of tail, sort and uniq).
- The minimum and maximum values can be found by combining sort and head (or tail),
- To write the script, use the positional parameters $1 and $2 for the file name and column number, respectively.

Solutions

Intermezzo 1.1

Solution

(a) ~~Go to your home directory.~~ Go to /fs/ess/PAS1855/users/$USER.

$ cd /fs/ess/PAS1855/users/$USER # To home would have been: "cd ∼"

(b) Navigate to the sandbox directory within the CSB/unix directory.

cd CSB/unix/sandbox

(c) Use a relative path to go to the data directory within the python directory.

cd ../../python/data

(d) Use an absolute path to go to the sandbox directory within python.

cd /fs/ess/PAS1855/users/$USER/CSB/python/sandbox

(e) Return to the data directory within the python directory.

cd -

Intermezzo 1.2

Solution

(a) Go to the data directory within CSB/unix.

$ cd /fs/ess/PAS1855/users/$USER/CSB/unix/data

(b) How many lines are in file Marra2014_data.fasta?

wc -l Marra2014_data.fasta

(c) Create the empty file toremove.txt in the CSB/unix/sandbox directory without leaving the current directory.

touch ../sandbox/toremove.txt

(d) List the contents of the directory unix/sandbox.

ls ../sandbox

(e) Remove the file toremove.txt.

rm ../sandbox/toremove.txt

Intermezzo 1.3

Solution

(a) If we order all species names (fifth column) of Pacifici2013_data.csv
in alphabetical order, which is the first species? And which is the last?

cd ∼/CSB/unix/data/

# First species:
cut -d ";" -f 5 Pacifici2013_data.csv | sort | head -n 1

# Last species:
cut -d ";" -f 5 Pacifici2013_data.csv | sort | tail -n 1

# Or, using sort directly, but then you get all columns unless you pipe to cut:
sort -t";" -k 5 Pacifici2013_data.csv | head -n 1

Following the output that you wanted, you may have gotten errors like this:

sort: write failed: 'standard output': Broken pipe
sort: write error

This may seem disconcerting, but is nothing to worry about, and has to do with the way data streams through a pipe: after head/tail exits because it has done what was asked (print one line), sort or cut may still try to pass on data.

(b) How many families are represented in the database?

cut -d ";" -f 3 Pacifici2013_data.csv | tail -n +2 | sort | uniq | wc -l

1.10.1 Next-Generation Sequencing Data

1. Change directory to CSB/unix/sandbox.

cd /fs/ess/PAS1855/users/$USER/CSB/unix/sandbox

2. What is the size of the file Marra2014_data.fasta?

ls -lh ../data/Marra2014_data.fasta
# Among other output, this should show a file size of 553K

Alternatively, the command du (disk usage) can be used for more compact output:

du -h ../data/Marra2014_data.fasta

3. Create a copy of Marra2014_data.fasta in the sandbox, and name it my_file.fasta.

cp ../data/Marra2014_data.fasta my_file.fasta

To make sure the copy went well, list the files in the sandbox:

ls

4. How many contigs are classified as isogroup00036?

To count the occurrences of a given string, use grep with the option -c

grep -c isogroup00036 my_file.fasta 
# 16

You can also use a pipe and wc -l to count:

grep isogroup00036 my_file.fasta | wc -l

5. Replace the original “two-spaces” delimiter with a comma.

We use the tr option -s (squeeze) to change two spaces two one, and then replace the space with a ,:

cat my_file.fasta | tr -s ' ' ',' > my_file.tmp
mv my_file.tmp my_file.fasta

6. How many unique isogroups are in the file?

First, searching for > with grep will extract all lines with contig information:

grep '>' my_file.fasta | head -n 2
# >contig00001,length=527,numreads=2,gene=isogroup00001,status=it_thresh
# >contig00002,length=551,numreads=8,gene=isogroup00001,status=it_thresh

Now, use cut to extract the 4th column

grep '>' my_file.fasta | cut -d ',' -f 4 | head -n 2
#gene=isogroup00001
#gene=isogroup00001

Finally, use sort -> uniq -> wc -l to count the number of unique occurrences:

grep '>' my_file.fasta | cut -d ',' -f 4 | sort | uniq | wc -l
# 43

7. Which contig has the highest number of reads (numreads)? How many reads does it have?

We need to isolate the number of reads as well as the contig names. We can use a combination of grep and cut:

grep '>' my_file.fasta | cut -d ',' -f 1,3 | head -n 3
# >contig00001,numreads=2
# >contig00002,numreads=8
# >contig00003,numreads=2

Now we want to sort according to the number of reads. However, the number of reads is part of a more complex string. We can use -t ‘=’ to split according to the = sign, and then take the second column (-k 2) to sort numerically (-n):

grep '>' my_file.fasta | cut -d ',' -f 1,3 | sort -t '=' -k 2 -n | head -n 5
# >contig00089,numreads=1
# >contig00176,numreads=1
# >contig00210,numreads=1
# >contig00001,numreads=2
# >contig00003,numreads=2

Using the flag -r we can sort in reverse order:

grep '>' my_file.fasta | cut -d ',' -f 1,3 | sort -t '=' -k 2 -n -r | head -n 1
# >contig00302,numreads=3330

Finding that contig 00302 has the highest coverage, with 3330 reads.

1.10.2 Hormone Levels in Baboons

1. How many times were the levels of individuals 3 and 27 recorded?

First, let’s see the structure of the file:

head -n 3 ../data/Gesquiere2011_data.csv 
# maleID        GC      T
# 1     66.9    64.57
# 1     51.09   35.57

We want to extract all the rows in which the first column is 3 (or 27), and count them. To extract only the first column, we can use cut:

cut -f 1 ../data/Gesquiere2011_data.csv | head -n 3
# maleID
# 1
# 1

Then we can pipe the results to grep -c to count the number of occurrences (note the flag -w as we want to match 3 but not 13 or 23):

# For maleID 3
cut -f 1 ../data/Gesquiere2011_data.csv | grep -c -w 3
# 61

# For maleID 27
cut -f 1 ../data/Gesquiere2011_data.csv | grep -c -w 27
# 5

2. Write a script taking as input the file name and the ID of the individual, and returning the number of records for that ID.

In the script, we just need to incorporate the arguments given when calling the script, using $1 for the file name and $2 for the individual ID, into the commands that we used above:

cut -f 1 $1 | grep -c -w $2

A slightly more verbose and readable example:

#!/bin/bash

filename=$1
ind_ID=$2
echo "Counting the nr of occurrences of individual ${ind_ID} in file ${filename}:" 
cut -f 1 ${filename} | grep -c -w ${ind_ID}

Variables are assigned using name=value (no dollar sign!), and recalled using $name or ${name}. It is good practice to put curly braces around the variable name. We will talk more about bash variables in the next few weeks.

To run the script, assuming it is named count_baboons.sh:

bash count_baboons.sh ../data/Gesquiere2011_data.csv 27
# 5

1.10.3 Plant–Pollinator Networks

Solution

Counting rows:

Counting the number of rows amount to counting the number of lines. This is easily done with wc -l. For example:

wc -l ../data/Saavedra2013/n10.txt 
# 14 ../data/Saavedra2013/n10.txt

To avoid printing the file name, we can either use cat or input redirection:

cat ../data/Saavedra2013/n10.txt | wc -l
wc -l < ../data/Saavedra2013/n10.txt

Counting rows:

Counting the number of columns is more work. First, we need only the first line:

head -n 1 ../data/Saavedra2013/n10.txt
# 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0

Now we can remove all spaces and the line terminator using tr:

head -n 1 ../data/Saavedra2013/n10.txt | tr -d ' ' | tr -d '\n'
# 01000001000000000100

Finally, we can use wc -c to count the number of characters in the string:

head -n 1 ../data/Saavedra2013/n10.txt | tr -d ' ' | tr -d '\n' | wc -c
# 20

Final script:

#!/bin/bash

filename=$1

echo "Filename:"
echo ${filename}
echo "Number of rows:"
cat ${filename} | wc -l
echo "Number of columns:"
head -n 1 ${filename} | tr -d ' ' | tr -d '\n' | wc -c

To run the script, assuming it is named counter.sh:

bash counter.sh ../data/Saavedra2013/n10.txt
# 5

We’ll learn about a quicker and more general way to count columns in a few weeks.

1.10.4 Data Explorer

Solution

First, we need to extract the column name. For example, for the Buzzard data file, and col 7:

cut -d ',' -f 7 ../data/Buzzard2015_data.csv | head -n 1
# biomass

Second, we need to obtain the number of distinct values. We can sort the results (after removing the header), and use uniq:

cut -d ',' -f 7 ../data/Buzzard2015_data.csv | tail -n +2 | sort | uniq | wc -l
# 285

Third, to get the max/min value we can use the code above, sort using -n, and either head (for min) or tail (for max) the result.

# Minimum:
cut -d ',' -f 7 ../data/Buzzard2015_data.csv | tail -n +2 | sort -n | head -n 1
# 1.048466198

# Maximum:
cut -d ',' -f 7 ../data/Buzzard2015_data.csv | tail -n +2 | sort -n | tail -n 1
# 14897.29471

Here is an example of what the script could look like:

#!/bin/bash

filename=$1
column_nr=$2

echo "Column name"
cut -d ',' -f ${column_nr} ${filename} | head -n 1
echo "Number of distinct values:"
cut -d ',' -f ${column_nr} ${filename} | tail -n +2 | sort | uniq | wc -l
echo "Minimum value:"
cut -d ',' -f ${column_nr} ${filename} | tail -n +2 | sort -n | head -n 1
echo "Maximum value:"
cut -d ',' -f ${column_nr} ${filename} | tail -n +2 | sort -n | tail -n 1

Week 1 Exercises

Main exercises

Getting set up

Intermezzo 1.1

Intermezzo 1.2

Intermezzo 1.3

1.10.1 Next-Generation Sequencing Data

Bonus exercises

1.10.2 Hormone Levels in Baboons

1.10.3 Plant–Pollinator Networks

1.10.4 Data Explorer

Solutions

Intermezzo 1.1

Intermezzo 1.2

Intermezzo 1.3

1.10.1 Next-Generation Sequencing Data

1.10.2 Hormone Levels in Baboons

1.10.3 Plant–Pollinator Networks

1.10.4 Data Explorer

Reuse