In week 1 (and 2), we learned the basics of the Unix shell.
This week, we'll focus on commands to inspect and process data:
We'll revisit some commands from CSB Chapter 1
(sort
, uniq
, grep
, ...) and learn about using regular expressions
in the shell.
We will also learn several new commands including the two very powerful
commands sed
and awk
(Thursday).
Start VS Code in OSC OnDemand => open your workspace =>
open a terminal => type bash
to break out of the Singularity shell.
You can either enter today's commands directly in the terminal,
or run them from a new .sh
file or a .md
file with code blocks.
If you want to do the latter: create and open this file.
Then:
# You should be in your dir already, otherwise:# $ cd /fs/ess/PAS1855/users/$USER# Download the repository for the Buffalo book:$ git clone https://github.com/vsbuffalo/bds-files.git# Move into the dir for this chapter:$ cd bds-files/chapter-07-unix-data-tools# Take a look at what's there:$ ls
FASTA (.fasta
or .fa
)
FASTQ (.fastq
or .fq
)
BED (.bed
)
GTF (.gtf
)
If you want to learn more about these and other formats,
see Buffalo ch. 10 & 11.
.fasta
)Can have one or more sequences of any length.
For every sequence:
The first line has a free form sequence identifier starting with >
.
The second line has the sequence (sometimes spread across multiple lines...).
>unique_sequence_ID Optional description (free form!)ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA>unique_sequence_ID2ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATG
.fastq
).fastq
)Basically a FASTA file with quality scores, but also more formalized. Represents individual reads from a sequencer.
Each sequence covers exactly 4 lines:
Line | Description |
---|---|
1 | Sequence header: begins with @ , then information about the read |
2 | DNA sequence |
3 | + separator |
4 | Quality scores: a single character for each base, 1-on-1 correspondence with sequences on line 2. |
@DJB775P1:248:D0MDGACXX:7:1202:12362:49613CACTGCTTGCTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTT+CCCFFFFFHHHHHGHGFHIJJ9HHIIJJJJJIJJJJIIJJJIIJIJJJJJJJJJIIJJJJJJHI@DJB775P1:248:D0MDGACXX:7:1202:12782:49716TTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACC+;@@FDFFFH0@DDGHIIGHIIIHIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIII
.bed
)Often just contains genomic coordinates in 3 columns:
Chromosome / scaffold
Start position
End position
chr1 26 39chr1 32 47chr3 11 28chr1 40 49
.gtf
)While genome sequences are stored in FASTA files, annotation information is stored in files like GTF.
Like a BED file, these contain coordinates for genomic features genes, transcripts, exons, etc., one per row.
But they also contain lots of additional information for each feature:
# chrom biotype* feature start end score strand frame attribute1 snRNA gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";1 snRNA transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl";1 snRNA exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; exon_number "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000522066";
.gtf
)While genome sequences are stored in FASTA files, annotation information is stored in files like GTF.
Like a BED file, these contain coordinates for genomic features genes, transcripts, exons, etc., one per row.
But they also contain lots of additional information for each feature:
# chrom biotype* feature start end score strand frame attribute1 snRNA gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";1 snRNA transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl";1 snRNA exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; exon_number "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000522066";
Note that GFF (.gff
) files are more common nowadays and very similar.
head
and tail
again$ mus=Mus_musculus.GRCm38.75_chr1 # Save for quick recall
head
/ tail
will simply show the first / last few lines of a file.$ head "$mus".bed # First/last 10 lines$ tail -n 7 "$mus".gtf # Specify with -n
head
and tail
again$ mus=Mus_musculus.GRCm38.75_chr1 # Save for quick recall
head
/ tail
will simply show the first / last few lines of a file.$ head "$mus".bed # First/last 10 lines$ tail -n 7 "$mus".gtf # Specify with -n
$ head -n 1866 "$mus".bed | tail -n 1 # Prints line nr. 1866
head
and tail
again$ mus=Mus_musculus.GRCm38.75_chr1 # Save for quick recall
head
/ tail
will simply show the first / last few lines of a file.$ head "$mus".bed # First/last 10 lines$ tail -n 7 "$mus".gtf # Specify with -n
$ head -n 1866 "$mus".bed | tail -n 1 # Prints line nr. 1866
Also useful to see if a pipeline is doing what it should:
$ grep "string" huge_file.txt | program1 | program2 | head -n 5
less
pagerless
doesn't load entire files into memory: easy to look at large files.$ less "$mus".gtf
less
pagerless
doesn't load entire files into memory: easy to look at large files.$ less "$mus".gtf
Key | Function |
---|---|
q | Exit less |
space / b | Next / previous page. ( pgup / pgdn usually also work.) |
d / u | Go down / up half a page. |
g / G | Go to the first / last line (home / end also work) |
/<pattern> ? <pattern> |
Search forward/backward: next type keyword to search for |
n / N | Go to next/previous search match |
less
Open a FASTQ file with less
:
$ less contaminated.fastq
Jump back to the last line and then back to the first line.
Now search for the following pattern: AGATCGG
.
Move to the next match and the next.
Exit.
Regular expressions are character sequences defining a search pattern, with symbols with a special, non-literal meaning (More in Week 11!).
Some symbols mentioned in this chapter:
Symbol | Matches | Example |
---|---|---|
. |
Any character | grep -o "Olfr.*" |
* |
Preceding character any # times |
grep -o "Olfr.*" |
Regular expressions are character sequences defining a search pattern, with symbols with a special, non-literal meaning (More in Week 11!).
Some symbols mentioned in this chapter:
Symbol | Matches | Example |
---|---|---|
. |
Any character | grep -o "Olfr.*" |
* |
Preceding character any # times |
grep -o "Olfr.*" |
\t |
tab | echo -e "column1 \t column2" |
\n |
newline | echo -e "Line1 \n Line2" |
Regular expressions are character sequences defining a search pattern, with symbols with a special, non-literal meaning (More in Week 11!).
Some symbols mentioned in this chapter:
Symbol | Matches | Example |
---|---|---|
. |
Any character | grep -o "Olfr.*" |
* |
Preceding character any # times |
grep -o "Olfr.*" |
\t |
tab | echo -e "column1 \t column2" |
\n |
newline | echo -e "Line1 \n Line2" |
^ / $ |
beginning/end of line |
grep -v "^$" |
\w |
any alphanumeric character and "_" |
grep -E -o 'gene_id "\w+"' |
| |
logical or | grep -E "intron|exon" grep "intron\|exon" |
grep
and sed
use Basic Regular Expressions (BRE) —grep -E
or sed -E
.
awk
uses ERE by defaults.Where this gets particularly confusing when googling around, is that there are both "POSIX" and "GNU" versions of each. We are using GNU tools, so the GNU regex are relevant to us.
grep
and sed
use Basic Regular Expressions (BRE) —grep -E
or sed -E
.
awk
uses ERE by defaults.Where this gets particularly confusing when googling around, is that there are both "POSIX" and "GNU" versions of each. We are using GNU tools, so the GNU regex are relevant to us.
ERE | BRE | Meaning |
---|---|---|
? |
\? |
Matches preceding character at most once |
+ |
\+ |
Matches preceding character at least once |
{m,n} |
\{m,n\} |
Matches preceding character m to n times |
| |
\| |
Logical or |
(ab)\1 |
\(ab \)\1 |
Backreference capture with () and recall with \1 ; matches "abab" |
Wildcard | Regex symbol(s) | Meaning |
---|---|---|
? |
. |
Any single character |
* |
.* |
Any number of any character |
[] and [^] |
same! | Match/negate match of character class |
Furthermore, recall that:
Wildcards match file names; matches are expanded directly by the shell.
Regular expressions match any input for the command.
grep
partial & word matchinggrep
always allows partial matching (cf. file globbing):$ grep "Olfr" "$mus"_genes.txt | head#> ENSMUSG00000067064 Olfr1416#> ENSMUSG00000057464 Olfr1415#> ...
grep
partial & word matchinggrep
always allows partial matching (cf. file globbing):$ grep "Olfr" "$mus"_genes.txt | head#> ENSMUSG00000067064 Olfr1416#> ENSMUSG00000057464 Olfr1415#> ...
-w
to match only entire "words" –
consecutive alphanumeric characters and "_":
$ cat example.txt$ grep "bioinfo" example.txt#> bioinfo#> bioinformatics$ grep -w "bioinfo" example.txt#> bioinfo
grep
: Search for multiple patternsMatch two different records using a character class:
$ grep "Olfr141[13]" "$mus"_genes.txt#> ENSMUSG00000058904 Olfr1413#> ENSMUSG00000062497 Olfr1411
More flexible – alternation with |
:
$ grep -E "(Olfr1411|Olfr1413)" "$mus"_genes.txt#> ENSMUSG00000058904 Olfr1413#> ENSMUSG00000062497 Olfr1411
(Note that we need -E
for extended regex for the |
to work.)
grep
: Inverting matches-v
to invert matches:
# All strings (gene names) with Olfr except Olfr1413:$ grep "Olfr" "$mus"_genes.txt | grep -v "Olfr1413"# Don't print lines beginning with a "#":$ grep -v "^#" "$mus".gtf# Don't print empty lines:$ grep -v "^$" "$mus".gtf
grep
: Counting matchesCount matching lines using -c
:
# Count snRNAs:$ grep -c 'gene_biotype "snRNA"' "$mus".gtf# Count gene names starting with "Olfr":$ grep -Pc "\tOlfr" "$mus"_genes.txt
In the second example, we use the -P
flag to turn on "Perl-like regular
expressions – yes, yet another type of regex...!
This is needed to successfully match a tab character with \t
.
grep
: Counting matchesCount matching lines using -c
:
# Count snRNAs:$ grep -c 'gene_biotype "snRNA"' "$mus".gtf# Count gene names starting with "Olfr":$ grep -Pc "\tOlfr" "$mus"_genes.txt
In the second example, we use the -P
flag to turn on "Perl-like regular
expressions – yes, yet another type of regex...!
This is needed to successfully match a tab character with \t
.
The example in Buffalo (p. 144) does not use -P
but this fails!
grep
: Print lines surrounding matchesPrint lines surrounding matches using:
-A n
– print n
lines after the match-B n
– print n
lines before the match-C n
– print n
lines before and after the match$ grep -B 1 -A 2 "AGATCGG" contam.fastq | head -n 4
grep
: Print lines surrounding matchesPrint lines surrounding matches using:
-A n
– print n
lines after the match-B n
– print n
lines before the match-C n
– print n
lines before and after the match$ grep -B 1 -A 2 "AGATCGG" contam.fastq | head -n 4
When only using -A
or -B
, records are separated by --
lines.
To avoid this:
$ grep -B 1 --no-group-separator "string" file.txt
grep
: Only print the match itselfPrint only the match itself and not full lines with -o
:
Get all gene names starting with "Olfr":
$ grep "Olfr.*" "$mus"_genes.txt | head -n 2#> ENSMUSG00000067064 Olfr1416#> ENSMUSG00000057464 Olfr1415$ grep -o "Olfr.*" "$mus"_genes.txt | head -n 2#> Olfr1416#> Olfr1415
grep
: Only print the match itself (cont.)Capture the quoted word following the "gene_id" column:
$ grep -E 'gene_id "\w+"' "$mus".gtf | head -n 2#> 1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";#> 1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF";$ grep -E -o 'gene_id "\w+"' "$mus".gtf | head -n 2#> gene_id "ENSMUSG00000090025"#> gene_id "ENSMUSG00000090025"
grep
: Only print the match itself (cont.)Capture the quoted word following the "gene_id" column:
$ grep -E 'gene_id "\w+"' "$mus".gtf | head -n 2#> 1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";#> 1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF";$ grep -E -o 'gene_id "\w+"' "$mus".gtf | head -n 2#> gene_id "ENSMUSG00000090025"#> gene_id "ENSMUSG00000090025"
Use a nice little pipeline to get a cleaned list of gene names,
building on the previous command:
$ grep -E -o 'gene_id "\w+"' "$mus".gtf | \ cut -f2 -d" " | \ sed 's/"//g' | \ sort | \ uniq > mm_gene_id.txt#> ENSMUSG00000000544#> ENSMUSG00000000817#> ...
grep
Count the number of genomic features in Mus_musculus.GRCm38.75_chr1.gtf
with a single grep
command.
(Hint: this is the number of lines excluding the header.)
Bonus: Count the number of "exon" features belonging to a "lincRNA"
by grep
ping for both of these strings.
(You can either pipe together two grep commands,
or use a single grep
command –
"lincRNA" will occur before "exon" on a line.)
grep
Count the number of genomic features in Mus_musculus.GRCm38.75_chr1.gtf
with a single grep
command.
$ grep -cv "^#" $mus.gtf#> 81226
Bonus: Count the number of "exon" features belonging to a "lincRNA"
by grep
ping for both of these strings.
$ grep "lincRNA" $mus.gtf | grep -c "exon"#> 408$ grep -c "lincRNA.*exon" $mus.gtf#> 408
sort | uniq -c
Using uniq -c
on sorted data, we get a count for each unique occurrence:
$ cat letters.txt#> A#> A#> B#> C#> B#> C#> C#> C$ sort letters.txt | uniq -c#> 2 A#> 2 B#> 4 C
sort | uniq -c
(cont.)The sort | uniq -c
combination is very useful –
e.g. the following will print counts for each type of annotated element
("feature") in a genome:
$ grep -v "^#" "$mus".gtf | cut -f3 | sort | uniq -c#> 25901 CDS#> 36128 exon#> 2027 gene#> ...
Next, we can sort these counts in order from most frequent to least:
$ grep -v "^#" "$mus".gtf | cut -f3 | sort | uniq -c | sort -rn#> 36128 exon#> 25901 CDS#> 7588 UTR
sort | uniq -c
(cont.)$ grep -v "^#" "$mus".gtf | cut -f3,7 | sort | uniq -c#> 12891 CDS +#> 13010 CDS -#> 18134 exon +#> 17994 exon -
sort | uniq -c
(cont.)$ grep -v "^#" "$mus".gtf | cut -f3,7 | sort | uniq -c#> 12891 CDS +#> 13010 CDS -#> 18134 exon +#> 17994 exon -
$ grep "ENSMUSG00000033793" "$mus".gtf | \ cut -f3 | sort | uniq -c#> 13 CDS#> 14 exon#> 1 gene
sort | uniq -c
(cont.)$ grep -v "^#" "$mus".gtf | cut -f3,7 | sort | uniq -c#> 12891 CDS +#> 13010 CDS -#> 18134 exon +#> 17994 exon -
$ grep "ENSMUSG00000033793" "$mus".gtf | \ cut -f3 | sort | uniq -c#> 13 CDS#> 14 exon#> 1 gene
Another useful option is -d
to check for duplicate lines:
$ uniq -d mm_gene_names.txt | wc -l # 0 if no duplicates
sort
-k
to select fields, n
to turn on numeric sorting,
and -r
for reverse sorting:
# Sort a BED file by chromosome and (reversed) start position:$ sort -k1,1 -k2,2nr example.bed
-V
to recognize numbers within strings, and sort accordingly:
$ sort -k1,1 -k2,2n example2.bed#> chr1#> chr10#> chr2$ sort -k1,1V -k2,2n example2.bed > example_sorted.bed#> chr1#> chr2
sort
-k
to select fields, n
to turn on numeric sorting,
and -r
for reverse sorting:
# Sort a BED file by chromosome and (reversed) start position:$ sort -k1,1 -k2,2nr example.bed
-V
to recognize numbers within strings, and sort accordingly:
$ sort -k1,1 -k2,2n example2.bed#> chr1#> chr10#> chr2$ sort -k1,1V -k2,2n example2.bed > example_sorted.bed#> chr1#> chr2
The -g
flag will properly sort scientific number notation like 10e-2.
column
for tabular file viewing (and cut
)$ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
column
for tabular file viewing (and cut
)$ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
column
, we can make this look better:$ grep -v "^#" "$mus".gtf | cut -f 1-8 | column -t | head -n 3
column
for tabular file viewing (and cut
)$ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
column
, we can make this look better:$ grep -v "^#" "$mus".gtf | cut -f 1-8 | column -t | head -n 3
column
can make even more of a difference for CSV files:$ column -s "," -t "$mus"_bed.csv | head -n 3
column
for tabular file viewing (and cut
)$ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
column
, we can make this look better:$ grep -v "^#" "$mus".gtf | cut -f 1-8 | column -t | head -n 3
column
can make even more of a difference for CSV files:$ column -s "," -t "$mus"_bed.csv | head -n 3
From Buffalo Chapter 7:
column
illustrates an important point about how we should treat data: there’s no reason to make data formats attractive at the expense of readable by programs.
join
join
can merge files that share a certain column.
Say we need to add a column with chromosome lengths to a BED file:
$ cat example.bed$ cat example_lengths.txt
join
will only work for sorted data, so we sort / check sortedness:
$ sort -k1,1 example.bed > example_sorted.bed$ sort -c -k1,1 example_lengths.txt # verifies is already sorted
Now, we can join the files:
$ join -1 1 -2 1 example_sorted.bed example_lengths.txt
join
join
can merge files that share a certain column.
Say we need to add a column with chromosome lengths to a BED file:
$ cat example.bed$ cat example_lengths.txt
join
will only work for sorted data, so we sort / check sortedness:
$ sort -k1,1 example.bed > example_sorted.bed$ sort -c -k1,1 example_lengths.txt # verifies is already sorted
Now, we can join the files:
$ join -1 1 -2 1 example_sorted.bed example_lengths.txt
This type of join is an "inner join": only rows found in both files are
returned. To return rows found in only in one of the two files,
use the -a
option.
-c
to check if a file is sorted — we get a message when the file is
not sorted, but no output when the file is sorted:
$ sort -k1,1 -k2,2n -c example.bed#> sort: example.bed:4: disorder: chr1 40 49$ sort -k1,1 -k2,2n -c example_sorted.bed$
If we need to check whether a file is sorted in a script before sorting it,
we can make use of the exit status of the command:
0
= success1
= fail (As well as other non-zero numbers)$ sort -k1,1V -k2,2n -c example_sorted.bed$ echo $?
You can make use of this as follows:
$ if ! sort -k1,1V -k2,2n -c example.bed; then> echo "File is unsorted - sorting now..."> sort -k1,1V -k2,2n example.bed > example_sorted2.bed> fi
This construct can also be used with grep
,
which will have exit status 0
(success) if it found a match:
$ if grep "AGATCGG" contimated.fasta > /dev/null; then> echo "OH NO! File is contaminated!"> exit 1> fi
An additional trick we used here is to redirect the standard
output to /dev/null
,
which won't write anything and will simply avoid the output being printed.
In week 1 (and 2), we learned the basics of the Unix shell.
This week, we'll focus on commands to inspect and process data:
We'll revisit some commands from CSB Chapter 1
(sort
, uniq
, grep
, ...) and learn about using regular expressions
in the shell.
We will also learn several new commands including the two very powerful
commands sed
and awk
(Thursday).
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |