Exercises: Week 4

Main exercises

These exercises will use the same files that we’ve been working with in class: those in chapter-07-unix-data-tools in the bds-files repository for the Buffalo book.

You can either work directly in that directory, or create a “sandbox” directory with the Buffalo files copied.

For many of the exercises, there are multiple ways of solving them. The hints and solutions provide one or a few ways, but you may well end up using a different approach. If you struggle, try to follow the approach suggested by the hints, so you can later cross-check with solutions that use this same approach.

For most questions, the code required is quite similar to the examples in Buffalo chapter 7 (and the slides). If the hints are too cryptic for you, I recommend you first go back to the section on the relevant command in the chapter.

Exercise 1: Pseudogenes with grep, uniq -c, and others

In this exercise, we will start using the file Mus_musculus.GRCm38.75_chr1.gtf that we’ve seen in class.

Let’s say that you need to extract annotation information for all pseudogenes on mouse chromosome 1 from the GTF, and write the results to a new GTF file. We’ll do that in this exercise with a detour along the way.

  1. Count the number of lines with the string “pseudogene”.
Hints

Use grep -c to count the matches.

Does this mean there are over a 1,000 pseudogenes on this chromosome? No, because there are “nested” genomic features in the file: for each gene, several “subfeatures” are listed, such as individual exons. This genomic feature type information is listed in column 3.

  1. For lines that match “pseudogene” as above, which different feature types are listed in column 3, and what are their counts (i.e., create a count table for column 3)?

    Are there any coding sequences (“CDS”), start codons (“start_codon”) and stop codons (“stop_codon”)?

Hints

To answer this question, compute count tables using sort | uniq -c after grepping for “pseudogene”.

  1. If you are more restrictive in your grep matching to only match “pseudogene” as a full “word”, do you still see CDS or start/stop codons?
Hints

Use grep -w to only match full “words”: consecutive alphanumeric characters and the underscore.

This would avoid matching longer words that contain “pseudogene”, such as “transcribed_pseudogene”.

  1. We have “exon”, “transcript” and “gene” features in column 3, but we want just the gene-level coordinates of the pseudogenes: after grepping for “pseudogene” like in the previous question, select only rows for which the feature type in column 3 is gene.

    Pipe your output into head or less; we don’t want to write to a new file yet.

Hints

The “gene” you want to match is in its own column. Matching restrictively using the -w flag to only match whole “words” will work in this case.

Even better would be to explicitly match the tabs surrounding “gene”. This can be done using the -P flag to grep (grep -P) and the \t symbol for tabs.

Yet another approach that is even more explicit uses awk to make the match specifically for the 3rd column (recall that $3 is the third column in awk, so you can use $3 == "gene"). If you try with awk, you’ll have to make sure to ignore the header lines, which are not tabular.

  1. Comprehensively check that your output only contains “gene” in column 3.
Hints

Pipe your previous command into cut and then sort | uniq -c to quickly check which values are present in column 3.

  1. Now, you are almost ready to write the results to a new GTF file, called pseudogenes.gtf, with the features you selected above: gene-level pseudogene features.

    One challenge is that you also want the header, which you won’t get when redirecting the output from the previous command.

    A straightforward solution would be to first write the header to a new file, and then append the lines with the selected features.

    Alternatively, you can use subshells to do this in a single go. If you want to try this, make sure you have read the Buffalo section on subshells (p. 169-170) first.

Hints

Structure the subshell solution as follows:
(<select-header> ; <select-from-rest-of-file>) > pseudogenes.gtf


Exercise 2: Calculations with awk

In this exercise, you will continue working with the GTF file Mus_musculus.GRCm38.75_chr1.gtf.

  1. How many elements are at least 1,000 base pairs long?
Hints

  1. How many genes (“gene” in column 3) on the + strand (“+” in column 7) are at least 1,000 base pairs long?
Hints

Adjust your command for the previous question as follows:

  1. What’s the combined size of all genes?
    (On this chromosome, that is: this GTF only contains a single chromosome.)
Hints

  1. What is the mean UTR length?
Hints

The code is very similar to that for the previous question; just grep for “UTR” instead, and calculate the mean by dividing the final value of awk variable by the number of rows (NR) in the END statement.


Exercise 3: Replacements with sed

The genotypes.txt contains diploid genotypes for 4 loci for 9 individuals.

  1. Remove all underscores from genotypes.txt and write the output to a new file.
Hints

  1. Replace the / delimiter for genotypes (G/G) in genotypes.txt by a colon (G:G) and write the output to a new file.
Hints


Bonus exercises

Exercise 4: FASTA

  1. First, convert the FASTQ file contaminated.fastq to a FASTA file using the following awk command, and try to understand what it does, by and large:
awk '{ if(NR%4 == 1) { printf(">%s\n", substr($0,2)); } else if(NR%4 == 2) print; }' \
    contaminated.fastq > contaminated.fasta
  1. Select FASTA entries (sequence and header) from contaminated.fasta with the sequence AGATCGGAAGA and write them to a new file, contaminated_sel.fasta. (Your new file should have 6 sequences, so 12 lines.)
Hints

Match the sequence with grep and use the -B option to also retrieve the preceding line, which is the sequence header/identifier.

In addition, use --no-group-separator to prevent grep from printing -- between matches (alternatively, you could use another call to grep with -v to filter out -- lines).

  1. Check how many sequences, if any, have one ore more undetermined bases (N).
Hints

Make sure you first exclude the header lines, since you only want to match N in the sequences.

  1. Apparently, there may have been a problem with cycle 50 in the Illumina run.
    Get the frequency of each base for position 50 in the FASTA file.
Hints

Use cut -c <n> to extract the 50th position.


Exercise 5: Exons

In this exercise, you will once again work with the GTF file Mus_musculus.GRCm38.75_chr1.gtf.

Now, let’s explore some statistics on exons in the GTF file.

  1. Create a tab-separated, two-column file called exons.txt that lists each unique “exon_id” along with its “gene_id”, sorted by gene ID. (The “exon_id” and “gene_id” key-value pairs are in the huge final column.)

    The first couple of lines of the file should look as follows:

      ENSMUSG00000000544      ENSMUSE00000160546
      ENSMUSG00000000544      ENSMUSE00000160549
      ENSMUSG00000000544      ENSMUSE00000226264
      ENSMUSG00000000544      ENSMUSE00000395516
Hints

Use the exons.txt file you created to answer the following questions.

  1. On average, how many exons does each gene have?
Hints

  1. What is the highest number of exons for one gene?
Hints

Use uniq -c followed by a reverse numeric sort.

  1. Create a count table for the number of exons per gene. It should tell you how many genes have one exon, how many have two exons, etc.
Hints


Exercise 6: Miscellaneous

In this exercise, you will once again work with the GTF file Mus_musculus.GRCm38.75_chr1.gtf.

  1. Count the total number of features on each strand (+ or -, column 7).

  2. Count the number of features from each of the three values for “gene_source” (“ensembl”, “havana”, and “ensembl_havana”).

Hints

You can either use sed with a backreference to capture the “gene_source” type, or grep -o to capture a larger match followed by cut and sed to extract just the “gene_source” type.

  1. How many genes have a UTR (untranslated region)? Create a list of “gene_id”s for genes with a UTR.
Hints

First grep for UTR, then for the “gene_id” while also extracting the match with grep -o. Once again, the extraction of a match can alternatively be done with a backreference in sed.


Solutions

Exercise 1: Pseudogenes with grep, uniq -c, and others

To start with, I will assign the name of the GTF file to the variable $gtf, for quick reference in the solutions for the and the next exercises:

gtf=Mus_musculus.GRCm38.75_chr1.gtf
(1.) Count the number of lines with the string “pseudogene”.

$ grep -c "pseudogene" $gtf

#> 1041
The -c option to grep will count the number of matching lines.

(2.) Count feature types.

$ grep -v "^#" $gtf | grep "pseudogene" | cut -f 3 | sort | uniq -c

#>     15 CDS
#>    524 exon
#>    223 gene
#>      4 start_codon
#>      4 stop_codon
#>    264 transcript
#>      7 UTR

The first line computes a count table for pseudogenes, and the second line computes an equivalent count table for all genes.

(3.) Count feature types after more restrictive matching.

$ grep -v "^#" $gtf | grep -w "pseudogene" | cut -f 3 | sort | uniq -c

#>    499 exon
#>    221 gene
#>    258 transcript

This is very similar to the solution for the previous exercise, but now we use the -w option to grep to only select full words (consecutive alphanumeric characters and the underscore).

(4.) Select “pseudogene” matches for which the feature type in column 3 is “gene”.

$ grep -v "^#" $gtf | grep -w "pseudogene" | grep -w "gene" | head

# Alternative 1: More explicit match by matching tabs:
$ grep -v "^#" $gtf | grep -w "pseudogene" | grep -P "\tgene\t" | head

# Alternative 2: More explicit match by matching column 3 only:
$ grep -v "^#" $gtf | grep -w "pseudogene" | awk '$3 == "gene"' | head

To match tabs like in the second solution, we need to use greps -P option, for "Perl-like regular expressions. (Yes, it confusing and unfortunate that there are three types of regular expressions in grep!)

In the third solution we explicitly and exactly match column 3 with awk – this is the most robust solution.

(5.) Check that your output only contains “gene” in column 3.

$ grep -v "^#" $gtf | grep "pseudogene" | grep -w "gene" | \
      cut -f 3 | sort | uniq -c

#> 223 gene

We select the third column and use the sort | uniq -c combination to create a count table, which will quickly tell us whether there are multiple values in the column.

(6.) Write the pseudogene selection to a new GTF file.

Solution when writing the header and the rest of the file separately (don’t forget to use >> for appending the output of the second line, if you use this method!):

$ grep "^#" $gtf > pseudogenes.gtf
$ grep -v "^#" $gtf | \
      awk '$2 == "pseudogene" && $3 == "gene"' \
      >> pseudogenes.gtf

Solution with subshells:

$ (grep "^#" $gtf; grep -v "^#" $gtf | \
      awk '$2 == "pseudogene" && $3 == "gene"') \
      > pseudogenes.gtf

Note: rather than with grep -v "^#", we can also exclude the header using !/^#/ in awk:

$ (grep "^#" $gtf; \
      awk '!/^#/ && $2 == "pseudogene" && $3 == "gene"' $gtf) \
      > pseudogenes.gtf

Exercise 2: Calculations with awk

(1.) How many features are at least 1,000 bp long?

$ grep -v "^#" $gtf | \
      awk '$5 - $4 > 999' | \
      wc -l

#> 8773
# Alternative that excludes the header in awk:
$ awk '!/^#/ && $5 - $4 > 999' $gtf | \
      wc -l

#> 8773

(2.) How many genes on the + strand are at least 1,000 bp long?

$ grep -v "^#" $gtf | \
      awk '$3 == "gene" && $7 == "+" && $5 - $4 > 999' | \
      wc -l

#> 725

(3.) What’s the combined size of all genes?

$ grep -v "^#" $gtf | \
      grep -w "gene" | \
      awk 'BEGIN { s=0 }; { s += $5 - $4 }; END { print s }'
  
#> 78935729    # 78,935,729 => ~79 Mbp

(4.) What is the mean UTR length?

$ grep -v "^#" $gtf | \
      grep -w "UTR" | \
      awk 'BEGIN { s=0 }; { s += $5 - $4 }; END { print s/NR }'
      
#> 472.665      

This code is very similar to that for the previous question (see that solution for more details), but now we divide the sum (variable s) by NR, an automatic variable in awk that holds the current record (line) number. Since the END statement is executed after all lines have been processed, the record number will be the total number of lines in the file.

Exercise 3: Replacements with sed

(1.) In genotypes.txt, remove all underscores.

$ sed 's/_//g' genotypes.txt > genotypes2.txt

$ cat genotypes2.txt
#> id      indA    indB    indC    indD
#> S000    G/G     A/G     A/A     A/G
#> S001    A/T     A/T     T/T     T/T
#> S002    C/T     T/T     C/C     C/T
#> S003    C/T     C/T     C/T     C/C
#> S004    C/G     G/G     C/C     C/G
#> S005    A/T     A/T     A/T     T/T
#> S006    C/G     C/C     C/G     C/G
#> S007    A/G     G/G     A/A     G/G
#> S008    G/T     G/T     T/T     G/T
#> S009    C/C     C/C     A/A     A/C
Since there are multiple underscores on line 1, you need the g modifier.

(2.) In genotypes.txt, replace the / delimiter by a colon (:).

$ sed 's#/#:#g' genotypes.txt > genotypes3.txt

$ cat genotypes3.txt
#> id      ind_A   ind_B   ind_C   ind_D
#> S_000   G:G     A:G     A:A     A:G
#> S_001   A:T     A:T     T:T     T:T
#> S_002   C:T     T:T     C:C     C:T
#> S_003   C:T     C:T     C:T     C:C
#> S_004   C:G     G:G     C:C     C:G
#> S_005   A:T     A:T     A:T     T:T
#> S_006   C:G     C:C     C:G     C:G
#> S_007   A:G     G:G     A:A     G:G
#> S_008   G:T     G:T     T:T     G:T
#> S_009   C:C     C:C     A:A     A:C

Because the pattern we want to match is a /, we use a different symbol in the sed syntax: here, I chose a #.

Exercise 4: FASTA

(2.) Write sequences with AGATCGGAAGA into a new file.

$ grep -B 1 --no-group-separator "AGATCGGAAGA" contaminated.fasta \
      > contaminated_sel.fasta

$ wc -l < contaminated_sel.fasta

#> 12

# Alternative to `--no-group-separator`;
# remove the lines with `--` afterwards:
$ grep -B 1 "AGATCGGAAGA" contaminated.fasta | \
      grep -v "--" > contaminated_sel.fasta

(3.) Check how many sequences have one or more undetermined bases.

Make sure you first exclude the header lines, since you only want to match “N” in the sequences themselves:

$ grep -v "^>" contaminated.fasta | grep -c "N"

#> 0

There are no Ns in the sequence.

(4.) Get the frequency of each base for position 50 in the fasta file.

$ grep -v "^>" contaminated.fasta | cut -c 50 | sort | uniq -c 

#> 2 C
#> 3 G
#> 3 T

Exercise 5: Exons

(1.) Create a tab-delimited file with “exon_id” and “gene_id”

$ grep "gene_id.*exon_id" $gtf | \
      sed -n -E 's/.*gene_id "(\w+)".*exon_id "(\w+).*/\1\t\2/p' | \
      sort | uniq > exons.txt

(2.) On average, how many exons does each gene have?

$ n_genes=$(wc -l < exons.txt)
$ n_exons=$(cut -f 1 exons.txt | uniq | wc -l)

# Divide using `expr`:
$ expr $(wc -l < exons.txt) / $(cut -f 1 exons.txt | uniq | wc -l)

#> 11

# Alternatively, do the division with Python:
$ python -c "print($n_genes / $n_exons)"

#> 11.627

(3.) What is the highest number of exons for one gene?

$ cut -f 1 exons.txt | uniq -c | sort -rn | head

#>    134 ENSMUSG00000026131
#>    116 ENSMUSG00000066842
#>    112 ENSMUSG00000026207
#>    108 ENSMUSG00000026609
#>    102 ENSMUSG00000006005
#>     97 ENSMUSG00000037470
#>     93 ENSMUSG00000026141
#>     92 ENSMUSG00000026490
#>     91 ENSMUSG00000048000
#>     82 ENSMUSG00000073664

“ENSMUSG00000026131” has as many as 134 exons!!

(4.) Create a count table for the number of exons per gene.

$ cut -f 1 exons.txt | uniq -c | \
      sed 's/^ *//' | cut -d " " -f 1 | \
      sort -rn | uniq -c | \
      sort -rn > exon_count_table.txt

$ head exon_count_table.txt     
#>    646 1   # 646 genes with 1 exon
#>    186 2   # 186 genes with 2 exons
#>     94 3   # etc...
#>     71 5
#>     71 4
#>     53 8
#>     51 7
#>     51 6
#>     44 11
#>     41 10

Exercise 6: Miscellaneous

(1.) Count the number of elements on each strand.

$ grep -v "^#" $gtf | cut -f 7 | sort | uniq -c

#> 40574 +
#> 40652 -

(2.) Count the number of elements for each of the three values for “gene_source”.

# Capture the gene_source with sed:
$ sed -n -E 's/.*gene_source "(\w+)".*/\1/p' $gtf | \
      sort | uniq -c

#> 18209 ensembl
#> 61089 ensembl_havana
#> 1928 havana

# Alternatively, capture the gene_source with `grep -o`:
$ grep -E -o 'gene_source "\w+"' $gtf | \
      cut -f2 -d" " | sed 's/"//g' | \
      sort | uniq -c
      
#> 18209 ensembl
#> 61089 ensembl_havana
#> 1928 havana

(3.) How many genes have a UTR? Create a list of gene_ids for genes with a UTR.

$ grep -v "^#" $gtf | grep -w "UTR" | \
   grep -E -o 'gene_id "\w+"' | sort | uniq | wc -l

#> 1179

# Create the list of genes:
$ grep -v "^#" $gtf | grep -w "UTR" | \
      grep -E -o 'gene_id "\w+"' | \
      cut -f2 -d" " | sed 's/"//g' | \
      sort | uniq > genes_with_UTR.txt

# Alternative with sed:
$ grep -v "^#" $gtf | grep -w "UTR" | \
      sed -n -E 's/.*gene_id "(\w+)".*/\1/p' | \
      sort | uniq | wc -l > genes_with_UTR.txt

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".