+ - 0:00:00
Notes for current slide
Notes for next slide

Week 4: Unix Data Tools


Part I:
Miscellaneous data tools






Jelmer Poelstra

2021/02/02

1 / 41

Context for this week

In week 1 (and 2), we learned the basics of the Unix shell.


This week, we'll focus on commands to inspect and process data:

  • We'll revisit some commands from CSB Chapter 1 (sort, uniq, grep, ...) and learn about using regular expressions in the shell.

  • We will also learn several new commands including the two very powerful commands sed and awk (Thursday).

2 / 41

Get the files for Buffalo Chapter 7

  • Start VS Code in OSC OnDemand => open your workspace => open a terminal => type bash to break out of the Singularity shell.

  • You can either enter today's commands directly in the terminal, or run them from a new .sh file or a .md file with code blocks.
    If you want to do the latter: create and open this file.

  • Then:

    # You should be in your dir already, otherwise:
    # $ cd /fs/ess/PAS1855/users/$USER
    # Download the repository for the Buffalo book:
    $ git clone https://github.com/vsbuffalo/bds-files.git
    # Move into the dir for this chapter:
    $ cd bds-files/chapter-07-unix-data-tools
    # Take a look at what's there:
    $ ls
3 / 41

Very quick intro to the sequence file formats
used in this chapter

  • FASTA (.fasta or .fa)

  • FASTQ (.fastq or .fq)

  • BED (.bed)

  • GTF (.gtf)


If you want to learn more about these and other formats,
see Buffalo ch. 10 & 11.

5 / 41

FASTA files (.fasta)

  • Can have one or more sequences of any length.

  • For every sequence:

    • The first line has a free form sequence identifier starting with >.

    • The second line has the sequence (sometimes spread across multiple lines...).

>unique_sequence_ID Optional description (free form!)
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATG
6 / 41

FASTQ files (.fastq)

  • Basically a FASTA file with quality scores, but also more formalized. Represents individual reads from a sequencer.
7 / 41

FASTQ files (.fastq)

  • Basically a FASTA file with quality scores, but also more formalized. Represents individual reads from a sequencer.

  • Each sequence covers exactly 4 lines:

Line Description
1 Sequence header: begins with @, then information about the read
2 DNA sequence
3 + separator
4 Quality scores: a single character for each base,
1-on-1 correspondence with sequences on line 2.
@DJB775P1:248:D0MDGACXX:7:1202:12362:49613
CACTGCTTGCTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTT
+
CCCFFFFFHHHHHGHGFHIJJ9HHIIJJJJJIJJJJIIJJJIIJIJJJJJJJJJIIJJJJJJHI
@DJB775P1:248:D0MDGACXX:7:1202:12782:49716
TTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACCACTGCTTACTCTGCGTTGATACC
+
;@@FDFFFH0@DDGHIIGHIIIHIIIIIIIIIIIIIIIGIIIIIIIIIIIIIIIIIIIIIIIII
7 / 41

BED files (.bed)

Often just contains genomic coordinates in 3 columns:

  1. Chromosome / scaffold

  2. Start position

  3. End position

chr1 26 39
chr1 32 47
chr3 11 28
chr1 40 49
8 / 41

GTF files (.gtf)

While genome sequences are stored in FASTA files, annotation information is stored in files like GTF.

  • Like a BED file, these contain coordinates for genomic features genes, transcripts, exons, etc., one per row.

  • But they also contain lots of additional information for each feature:

# chrom biotype* feature start end score strand frame attribute
1 snRNA gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1 snRNA transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl";
1 snRNA exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; exon_number "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000522066";


9 / 41

GTF files (.gtf)

While genome sequences are stored in FASTA files, annotation information is stored in files like GTF.

  • Like a BED file, these contain coordinates for genomic features genes, transcripts, exons, etc., one per row.

  • But they also contain lots of additional information for each feature:

# chrom biotype* feature start end score strand frame attribute
1 snRNA gene 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA";
1 snRNA transcript 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl";
1 snRNA exon 3102016 3102125 . + . gene_id "ENSMUSG00000064842"; transcript_id "ENSMUST00000082908"; exon_number "1"; gene_name "Gm26206"; gene_source "ensembl"; gene_biotype "snRNA"; transcript_name "Gm26206-201"; transcript_source "ensembl"; exon_id "ENSMUSE00000522066";


Note that GFF (.gff) files are more common nowadays and very similar.

9 / 41

Viewing data files







10 / 41

head and tail again

$ mus=Mus_musculus.GRCm38.75_chr1 # Save for quick recall
  • head / tail will simply show the first / last few lines of a file.
    $ head "$mus".bed # First/last 10 lines
    $ tail -n 7 "$mus".gtf # Specify with -n
11 / 41

head and tail again

$ mus=Mus_musculus.GRCm38.75_chr1 # Save for quick recall
  • head / tail will simply show the first / last few lines of a file.
    $ head "$mus".bed # First/last 10 lines
    $ tail -n 7 "$mus".gtf # Specify with -n
  • Print a specific line number:
    $ head -n 1866 "$mus".bed | tail -n 1 # Prints line nr. 1866
11 / 41

head and tail again

$ mus=Mus_musculus.GRCm38.75_chr1 # Save for quick recall
  • head / tail will simply show the first / last few lines of a file.
    $ head "$mus".bed # First/last 10 lines
    $ tail -n 7 "$mus".gtf # Specify with -n
  • Print a specific line number:
    $ head -n 1866 "$mus".bed | tail -n 1 # Prints line nr. 1866
  • Also useful to see if a pipeline is doing what it should:

    $ grep "string" huge_file.txt | program1 | program2 | head -n 5
11 / 41

The less pager

  • less doesn't load entire files into memory: easy to look at large files.
    $ less "$mus".gtf
12 / 41

The less pager

  • less doesn't load entire files into memory: easy to look at large files.
    $ less "$mus".gtf
  • You'll be inside the pager, and your prompt is gone. Keyboard shortcuts:
Key Function
q Exit less
space / b Next / previous page.
(pgup / pgdn usually also work.)
d / u Go down / up half a page.
g / G Go to the first / last line (home / end also work)
/<pattern>
?<pattern>
Search forward/backward: next type keyword to search for
n / N Go to next/previous search match
12 / 41

Your turn: less

  1. Open a FASTQ file with less:

    $ less contaminated.fastq
  2. Jump back to the last line and then back to the first line.

  3. Now search for the following pattern: AGATCGG.
    Move to the next match and the next.

  4. Exit.

13 / 41

Regular expressions







14 / 41

Regular expressions ("regex")

  • Regular expressions are character sequences defining a search pattern, with symbols with a special, non-literal meaning (More in Week 11!).

  • Some symbols mentioned in this chapter:

Symbol Matches Example
. Any character grep -o "Olfr.*"                            
* Preceding character
any # times
grep -o "Olfr.*"
15 / 41

Regular expressions ("regex")

  • Regular expressions are character sequences defining a search pattern, with symbols with a special, non-literal meaning (More in Week 11!).

  • Some symbols mentioned in this chapter:

Symbol Matches Example
. Any character grep -o "Olfr.*"
* Preceding character
any # times
grep -o "Olfr.*"
\t tab echo -e "column1 \t column2"
\n newline echo -e "Line1 \n Line2"
16 / 41

Regular expressions ("regex")

  • Regular expressions are character sequences defining a search pattern, with symbols with a special, non-literal meaning (More in Week 11!).

  • Some symbols mentioned in this chapter:

Symbol Matches Example
. Any character grep -o "Olfr.*"
* Preceding character
any # times
grep -o "Olfr.*"
\t tab echo -e "column1 \t column2"
\n newline echo -e "Line1 \n Line2"
^ / $ beginning/end
of line
grep -v "^$"
\w any alphanumeric
character and "_"
grep -E -o 'gene_id "\w+"'
| logical or grep -E "intron|exon"
grep "intron\|exon"
17 / 41

"Basic" versus "Extended" Regular Expressions

  • By default, grep and sed use Basic Regular Expressions (BRE) —
    to turn on Extended Regular Expressions (ERE), use grep -E or sed -E. awk uses ERE by defaults.

Where this gets particularly confusing when googling around, is that there are both "POSIX" and "GNU" versions of each. We are using GNU tools, so the GNU regex are relevant to us.

18 / 41

"Basic" versus "Extended" Regular Expressions

  • By default, grep and sed use Basic Regular Expressions (BRE) —
    to turn on Extended Regular Expressions (ERE), use grep -E or sed -E. awk uses ERE by defaults.

Where this gets particularly confusing when googling around, is that there are both "POSIX" and "GNU" versions of each. We are using GNU tools, so the GNU regex are relevant to us.

  • Differences between BRE and ERE for GNU tools:
ERE BRE Meaning
? \? Matches preceding character at most once
+ \+ Matches preceding character at least once
{m,n} \{m,n\} Matches preceding character m to n times
| \| Logical or
(ab)\1 \(ab \)\1 Backreference capture with () and
recall with \1; matches "abab"
18 / 41

Regular expressions versus shell wildcards

Wildcard Regex symbol(s) Meaning
? . Any single character
* .* Any number of any character
[] and [^] same! Match/negate match of character class


Furthermore, recall that:

  • Wildcards match file names; matches are expanded directly by the shell.

  • Regular expressions match any input for the command.

19 / 41

grep







20 / 41

grep partial & word matching

  • grep always allows partial matching (cf. file globbing):
    $ grep "Olfr" "$mus"_genes.txt | head
    #> ENSMUSG00000067064 Olfr1416
    #> ENSMUSG00000057464 Olfr1415
    #> ...
21 / 41

grep partial & word matching

  • grep always allows partial matching (cf. file globbing):
    $ grep "Olfr" "$mus"_genes.txt | head
    #> ENSMUSG00000067064 Olfr1416
    #> ENSMUSG00000057464 Olfr1415
    #> ...


  • -w to match only entire "words" –
    consecutive alphanumeric characters and "_":

    $ cat example.txt
    $ grep "bioinfo" example.txt
    #> bioinfo
    #> bioinformatics
    $ grep -w "bioinfo" example.txt
    #> bioinfo
21 / 41

grep: Search for multiple patterns

  • Match two different records using a character class:

    $ grep "Olfr141[13]" "$mus"_genes.txt
    #> ENSMUSG00000058904 Olfr1413
    #> ENSMUSG00000062497 Olfr1411
  • More flexible – alternation with |:

    $ grep -E "(Olfr1411|Olfr1413)" "$mus"_genes.txt
    #> ENSMUSG00000058904 Olfr1413
    #> ENSMUSG00000062497 Olfr1411

    (Note that we need -E for extended regex for the | to work.)

22 / 41

grep: Inverting matches

  • -v to invert matches:

    # All strings (gene names) with Olfr except Olfr1413:
    $ grep "Olfr" "$mus"_genes.txt | grep -v "Olfr1413"
    # Don't print lines beginning with a "#":
    $ grep -v "^#" "$mus".gtf
    # Don't print empty lines:
    $ grep -v "^$" "$mus".gtf
23 / 41

grep: Counting matches

  • Count matching lines using -c:

    # Count snRNAs:
    $ grep -c 'gene_biotype "snRNA"' "$mus".gtf
    # Count gene names starting with "Olfr":
    $ grep -Pc "\tOlfr" "$mus"_genes.txt


In the second example, we use the -P flag to turn on "Perl-like regular expressions – yes, yet another type of regex...!
This is needed to successfully match a tab character with \t.

24 / 41

grep: Counting matches

  • Count matching lines using -c:

    # Count snRNAs:
    $ grep -c 'gene_biotype "snRNA"' "$mus".gtf
    # Count gene names starting with "Olfr":
    $ grep -Pc "\tOlfr" "$mus"_genes.txt


In the second example, we use the -P flag to turn on "Perl-like regular expressions – yes, yet another type of regex...!
This is needed to successfully match a tab character with \t.

The example in Buffalo (p. 144) does not use -P but this fails!

24 / 41

grep: Print lines surrounding matches

  • Print lines surrounding matches using:

    • -A n – print n lines after the match
    • -B n – print n lines before the match
    • -C n – print n lines before and after the match
  • Use this to print the full 4-line FASTQ entry when the sequence matches:
    $ grep -B 1 -A 2 "AGATCGG" contam.fastq | head -n 4
25 / 41

grep: Print lines surrounding matches

  • Print lines surrounding matches using:

    • -A n – print n lines after the match
    • -B n – print n lines before the match
    • -C n – print n lines before and after the match
  • Use this to print the full 4-line FASTQ entry when the sequence matches:
    $ grep -B 1 -A 2 "AGATCGG" contam.fastq | head -n 4


When only using -A or -B, records are separated by -- lines.
To avoid this:

$ grep -B 1 --no-group-separator "string" file.txt
25 / 41

grep: Only print the match itself

Print only the match itself and not full lines with -o:

  • Get all gene names starting with "Olfr":

    $ grep "Olfr.*" "$mus"_genes.txt | head -n 2
    #> ENSMUSG00000067064 Olfr1416
    #> ENSMUSG00000057464 Olfr1415
    $ grep -o "Olfr.*" "$mus"_genes.txt | head -n 2
    #> Olfr1416
    #> Olfr1415
26 / 41

grep: Only print the match itself (cont.)

  • Capture the quoted word following the "gene_id" column:

    $ grep -E 'gene_id "\w+"' "$mus".gtf | head -n 2
    #> 1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";
    #> 1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF";
    $ grep -E -o 'gene_id "\w+"' "$mus".gtf | head -n 2
    #> gene_id "ENSMUSG00000090025"
    #> gene_id "ENSMUSG00000090025"
27 / 41

grep: Only print the match itself (cont.)

  • Capture the quoted word following the "gene_id" column:

    $ grep -E 'gene_id "\w+"' "$mus".gtf | head -n 2
    #> 1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene";
    #> 1 unprocessed_pseudogene transcript 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; transcript_id "ENSMUST00000160944"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; transcript_name "Gm16088-001"; transcript_source "havana"; tag "cds_end_NF"; tag "cds_start_NF"; tag "mRNA_end_NF"; tag "mRNA_start_NF";
    $ grep -E -o 'gene_id "\w+"' "$mus".gtf | head -n 2
    #> gene_id "ENSMUSG00000090025"
    #> gene_id "ENSMUSG00000090025"
  • Use a nice little pipeline to get a cleaned list of gene names,
    building on the previous command:

    $ grep -E -o 'gene_id "\w+"' "$mus".gtf | \
    cut -f2 -d" " | \
    sed 's/"//g' | \
    sort | \
    uniq > mm_gene_id.txt
    #> ENSMUSG00000000544
    #> ENSMUSG00000000817
    #> ...
27 / 41

Your turn: grep

  1. Count the number of genomic features in Mus_musculus.GRCm38.75_chr1.gtf with a single grep command.

    (Hint: this is the number of lines excluding the header.)

  2. Bonus: Count the number of "exon" features belonging to a "lincRNA" by grepping for both of these strings.

    (You can either pipe together two grep commands,
    or use a single grep command – "lincRNA" will occur before "exon" on a line.)

28 / 41

Solutions: grep

  1. Count the number of genomic features in Mus_musculus.GRCm38.75_chr1.gtf with a single grep command.

    $ grep -cv "^#" $mus.gtf
    #> 81226
  2. Bonus: Count the number of "exon" features belonging to a "lincRNA" by grepping for both of these strings.

    $ grep "lincRNA" $mus.gtf | grep -c "exon"
    #> 408
    $ grep -c "lincRNA.*exon" $mus.gtf
    #> 408
29 / 41

sort and sort | uniq -c







30 / 41

Count tables with sort | uniq -c

  • Using uniq -c on sorted data, we get a count for each unique occurrence:

    $ cat letters.txt
    #> A
    #> A
    #> B
    #> C
    #> B
    #> C
    #> C
    #> C
    $ sort letters.txt | uniq -c
    #> 2 A
    #> 2 B
    #> 4 C
31 / 41

Count tables with sort | uniq -c (cont.)

  • The sort | uniq -c combination is very useful – e.g. the following will print counts for each type of annotated element ("feature") in a genome:

    $ grep -v "^#" "$mus".gtf | cut -f3 | sort | uniq -c
    #> 25901 CDS
    #> 36128 exon
    #> 2027 gene
    #> ...
  • Next, we can sort these counts in order from most frequent to least:

    $ grep -v "^#" "$mus".gtf | cut -f3 | sort | uniq -c | sort -rn
    #> 36128 exon
    #> 25901 CDS
    #> 7588 UTR
32 / 41

Count tables with sort | uniq -c (cont.)

  • Or count combinations across columns – here, "features" by "strand":
    $ grep -v "^#" "$mus".gtf | cut -f3,7 | sort | uniq -c
    #> 12891 CDS +
    #> 13010 CDS -
    #> 18134 exon +
    #> 17994 exon -
33 / 41

Count tables with sort | uniq -c (cont.)

  • Or count combinations across columns – here, "features" by "strand":
    $ grep -v "^#" "$mus".gtf | cut -f3,7 | sort | uniq -c
    #> 12891 CDS +
    #> 13010 CDS -
    #> 18134 exon +
    #> 17994 exon -
  • Final example: count numbers of different features for a particular gene:
    $ grep "ENSMUSG00000033793" "$mus".gtf | \
    cut -f3 | sort | uniq -c
    #> 13 CDS
    #> 14 exon
    #> 1 gene
33 / 41

Count tables with sort | uniq -c (cont.)

  • Or count combinations across columns – here, "features" by "strand":
    $ grep -v "^#" "$mus".gtf | cut -f3,7 | sort | uniq -c
    #> 12891 CDS +
    #> 13010 CDS -
    #> 18134 exon +
    #> 17994 exon -
  • Final example: count numbers of different features for a particular gene:
    $ grep "ENSMUSG00000033793" "$mus".gtf | \
    cut -f3 | sort | uniq -c
    #> 13 CDS
    #> 14 exon
    #> 1 gene

Another useful option is -d to check for duplicate lines:

$ uniq -d mm_gene_names.txt | wc -l # 0 if no duplicates
33 / 41

Miscellaneous sort

  • -k to select fields, n to turn on numeric sorting, and -r for reverse sorting:

    # Sort a BED file by chromosome and (reversed) start position:
    $ sort -k1,1 -k2,2nr example.bed
  • -V to recognize numbers within strings, and sort accordingly:

    $ sort -k1,1 -k2,2n example2.bed
    #> chr1
    #> chr10
    #> chr2
    $ sort -k1,1V -k2,2n example2.bed > example_sorted.bed
    #> chr1
    #> chr2
34 / 41

Miscellaneous sort

  • -k to select fields, n to turn on numeric sorting, and -r for reverse sorting:

    # Sort a BED file by chromosome and (reversed) start position:
    $ sort -k1,1 -k2,2nr example.bed
  • -V to recognize numbers within strings, and sort accordingly:

    $ sort -k1,1 -k2,2n example2.bed
    #> chr1
    #> chr10
    #> chr2
    $ sort -k1,1V -k2,2n example2.bed > example_sorted.bed
    #> chr1
    #> chr2

The -g flag will properly sort scientific number notation like 10e-2.

34 / 41

Questions?






35 / 41

column for tabular file viewing (and cut)

  • Tab-delimited files can look messy in the terminal or text editors:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
37 / 41

column for tabular file viewing (and cut)

  • Tab-delimited files can look messy in the terminal or text editors:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
  • With column, we can make this look better:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | column -t | head -n 3
37 / 41

column for tabular file viewing (and cut)

  • Tab-delimited files can look messy in the terminal or text editors:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
  • With column, we can make this look better:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | column -t | head -n 3
  • column can make even more of a difference for CSV files:
    $ column -s "," -t "$mus"_bed.csv | head -n 3
37 / 41

column for tabular file viewing (and cut)

  • Tab-delimited files can look messy in the terminal or text editors:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | head -n3
  • With column, we can make this look better:
    $ grep -v "^#" "$mus".gtf | cut -f 1-8 | column -t | head -n 3
  • column can make even more of a difference for CSV files:
    $ column -s "," -t "$mus"_bed.csv | head -n 3


From Buffalo Chapter 7:

column illustrates an important point about how we should treat data: there’s no reason to make data formats attractive at the expense of readable by programs.

37 / 41

join

  • join can merge files that share a certain column.
    Say we need to add a column with chromosome lengths to a BED file:

    $ cat example.bed
    $ cat example_lengths.txt
  • join will only work for sorted data, so we sort / check sortedness:

    $ sort -k1,1 example.bed > example_sorted.bed
    $ sort -c -k1,1 example_lengths.txt # verifies is already sorted
  • Now, we can join the files:

    $ join -1 1 -2 1 example_sorted.bed example_lengths.txt
38 / 41

join

  • join can merge files that share a certain column.
    Say we need to add a column with chromosome lengths to a BED file:

    $ cat example.bed
    $ cat example_lengths.txt
  • join will only work for sorted data, so we sort / check sortedness:

    $ sort -k1,1 example.bed > example_sorted.bed
    $ sort -c -k1,1 example_lengths.txt # verifies is already sorted
  • Now, we can join the files:

    $ join -1 1 -2 1 example_sorted.bed example_lengths.txt

This type of join is an "inner join": only rows found in both files are returned. To return rows found in only in one of the two files, use the -a option.

38 / 41

Checking whether a file is sorted

  • -c to check if a file is sorted — we get a message when the file is not sorted, but no output when the file is sorted:

    $ sort -k1,1 -k2,2n -c example.bed
    #> sort: example.bed:4: disorder: chr1 40 49
    $ sort -k1,1 -k2,2n -c example_sorted.bed
    $
39 / 41

Exit status of a command

  • If we need to check whether a file is sorted in a script before sorting it,
    we can make use of the exit status of the command:

    • 0 = success
    • 1 = fail (As well as other non-zero numbers)
    $ sort -k1,1V -k2,2n -c example_sorted.bed
    $ echo $?
  • You can make use of this as follows:

    $ if ! sort -k1,1V -k2,2n -c example.bed; then
    > echo "File is unsorted - sorting now..."
    > sort -k1,1V -k2,2n example.bed > example_sorted2.bed
    > fi
40 / 41

Exit status of a command

This construct can also be used with grep, which will have exit status 0 (success) if it found a match:

$ if grep "AGATCGG" contimated.fasta > /dev/null; then
> echo "OH NO! File is contaminated!"
> exit 1
> fi


An additional trick we used here is to redirect the standard output to /dev/null, which won't write anything and will simply avoid the output being printed.

41 / 41

Context for this week

In week 1 (and 2), we learned the basics of the Unix shell.


This week, we'll focus on commands to inspect and process data:

  • We'll revisit some commands from CSB Chapter 1 (sort, uniq, grep, ...) and learn about using regular expressions in the shell.

  • We will also learn several new commands including the two very powerful commands sed and awk (Thursday).

2 / 41
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow