class:inverse middle center ## *Week 4: Unix Data Tools* ---- # Part II: <br> awk and sed <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/02/04 --- ## Setting up - Start VS Code in OSC OnDemand => open your workspace => open a terminal => type `bash` to break out of the Singularity shell. - Move into the Chapter 7 directory in your copy of the Buffalo Git repository, likely at: ```sh $ cd /fs/ess/PAS1855/users/$USER/bds-files/chapter-07-unix-data-tools # Lost the repo? Go to your directory and run: # git clone https://github.com/vsbuffalo/bds-files.git ``` --- class: inverse middle center name: awk # awk ---- <br> <br> <br> <br> <br> --- ## `awk` - `awk` is a small programming language in itself! - `awk` is great for *quick file processing*, especially on tabular data. It is not (necessarily) the best choice for complicated tasks – perhaps best used as a very powerful tool rather than a stand-alone programming language. <br> -- To understand `awk`, we should cover its two core ideas / concepts: - **Record processing** - **Pattern-action pairs** --- ## `awk` record processing: Records and fields - **Records**: - `awk` processes records one at a time, like a `for` loop going line-by-line. - By default, **each line is one record**. - The entire record is assigned to `$0`. ```sh $ awk '{ print $0 }' my.txt # $0: full line ``` <br> -- - **Fields**: - Within a line, **each column is a _field_**. - Columns are automatically assigned to `$1` (first column), `$2` (second column), `$3` (third column), etc. ```sh $ awk '{ print $3 "\t" $2 }' my.txt # $3 & $2: column 3 and 2 ``` --- ## `awk` record processing: Pattern-action pairs - **Pattern**: an expression or regex pattern – a *condition* to be tested. - **Action**: If the pattern evaluates to true, the action is performed. .content-box-blue[ General syntax: **`awk 'pattern { action }' file`** (Note also that the entire `awk` command is placed *between single quotes*.) ] -- <br> - Only one of the two (pattern and action) is *required* in any `awk` command: - Omit the pattern: *every record* is subjected to the action. ```sh $ awk '{ print $1 }' my.txt # Pattern omitted ``` - Omit the action: *print full records* that match the pattern. ```sh $ awk '$3 < 10' my.txt # Action omitted ``` --- ## Simple `awk` examples: action only **In these first examples, we omit a pattern – therefore, every record (lines) is subjected to the action we specify.** - If our action, then, is to print the entire record, we mimic `cat` — each line will be printed in full: ```sh $ awk '{ print $0 }' example.bed ``` --- ## Simple `awk` examples: action only (cont.) - We can also mimic `cut` – and here we go beyond `cut` to *reorder* columns: ```sh $ awk '{ print $3 "\t" $2 }' example.bed #> 39 26 #> 47 32 #> 28 11 ``` -- - Above, we explicitly printed a tab (`\t`) between the columns. What happens when we omit the `\t`? ```sh $ awk '{ print $3 $2 }' example.bed # Columns concatenated... #> 3926 #> 4732 #> 2811 ``` - We can use a comma `,` to invoke the default separator (a space): ```sh $ awk '{ print $3, $2 }' example.bed # Default sep: " " #> 39 26 #> 47 32 #> 28 11 ``` --- ## Simple `awk` examples: pattern only **In the next examples, we use patterns and omit the action: lines matching the pattern (condition) are printed in full.** - Print only lines where column 2 (start of BED feature) is larger than 30: ```sh $ awk '$2 > 30' example.bed ``` - Calculate the length of each BED feature (column 3 - column 2), and only print lines with a feature of at least 18 bp: ```sh $ awk '$3 - $2 > 18' example.bed ``` --- ## Simple `awk` examples: pattern only (cont.) - Print only lines from "chr1" – do an exact match for the first column: ```sh $ awk '$1 == "chr1"' example.bed # Quote the string ``` - Regular expression patterns are placed between forward slashes: ```sh $ awk '/chr1/' example.bed # Looks for match anywhere on line $ awk '$1 ~ /chr1/' example.bed # Looks for a match in column 1 ``` --- ## `awk` comparison and logical operators | Comparison | Description |-------------|-------------| | `a == b` | `a` is equal to `b` | `a != b` | `a` is not equal to `b` | `a < b` | `a` is less than `b` | `a > b` | `a` is greater than `b` | `a <= b` | `a` is less than or equal to `b` | `a >= b` | `a` is greater than or equal to `b` --- ## `awk` comparison and logical operators | Comparison | Description |-------------|-------------| | `a == b` | `a` is equal to `b` | `a != b` | `a` is not equal to `b` | `a < b` | `a` is less than `b` | `a > b` | `a` is greater than `b` | `a <= b` | `a` is less than or equal to `b` | `a >= b` | `a` is greater than or equal to `b` | `a ~ /b/` | `a` matches regular expression pattern `b` | `a !~ /b/` | `a` does not match regular expression pattern `b` | `a && b` | logical and: `a` **and** `b` | `a` <code>|</code><code>|</code> `b` | logical or: `a` **or** `b` *[note typo in Buffalo]* | `!a` | not a (logical negation) --- ## <i class="fa fa-user-edit"></i> Your turn: `awk` The following exercises use `example.bed`. 1. Use a pattern to only print features (= rows) which end at position 28. 2. Use an action to print, for each row, the chromosome and the length of the feature (column 3 minus column 2). 3. Combine the action from 1. with the pattern (condition) from 2. --- ## <i class="fa fa-user-edit"></i> Solutions: `awk` 1. Use a pattern to only print features (= rows) which end at position 28. ```sh $ awk '$3 == 28' example.bed ``` 2. Use an action to print, for each row, the chromosome and the length of the feature (column 3 minus column 2). ```sh $ awk '{ print $1, $3 - $2 }' example.bed $ awk '{ print $1 "\t" $3 - $2 }' example.bed # Alt: tab-delimited ``` 3. Combine the action from 1. with the pattern (condition) from 2. ```sh $ awk '$3 == 28 { print $1, $3 - $2 }' example.bed ``` --- ## `awk`: Filtering and combining expressions - Combining patterns with `&&` – print *chr1* features longer than 10 bp: ```sh $ awk '$1 ~ /chr1/ && $3 - $2 > 10' example.bed ``` - Select lines with "chr2" *or* "chr3" using **`|`**: ```sh $ awk '$1 ~ /chr2|chr3/ { print $0 "\t" $3 - $2 }' example.bed ``` -- .content-box-info[ **'|'** can be used directly in `awk`, as it uses *extended regex* (ERE). ] .content-box-info[ We can use **`|`** *within* a regex, and **`||`** (and **`&&`**) to chain together multiple regex: ```sh $ awk '/chr1|chr2/' example.bed $ awk '/chr1/ || /chr2/' example.bed ``` ] --- ## `awk` so far > *So far, these exercises have illustrated two ways Awk can come in handy:* > - *For filtering data using rules that can combine regular expressions and arithmetic* > - *Reformatting the columns of data using arithmetic* --- ## `awk`: Actions before and after record processing - The `BEGIN` and `END` patterns can be used to specify actions before and after record processing. Here, we: - Before record processing, initialize a variable `sum`; - For each record, add the sum of the feature length (`$3`-`$2`) to `sum`; - After record processing, calculate the mean length by dividing the total length `sum` by `NR`, the number of records. ```sh $ awk 'BEGIN{ sum = 0 }; { sum += ($3 - $2) }; END{ print "mean: " sum/NR };' example.bed #> mean: 14 ``` -- .content-box-info[ The `+=` operator is shorthand for adding to a variable: `x += 1` means `x = x + 1`. It is used in many languages including Python. ] .content-box-info[ When using multiple pattern-action pairs, separate them with "**`;`**". ] --- ## `awk` special variables and keywords | keyword/<br>variable | meaning | |----------|------------| | `BEGIN` | Used as a pattern that matches the start of the file | `END` | Used as a pattern that matches the end of the file | `NR` | Number of Records (running count; in `END`: total nr. of lines) | `NF` | Number of Fields (for each record) | `$0` | Contains entire record (usually a line) | `$1` - `$n` | Contains one column each --- ## `awk` special variables and keywords | keyword/<br>variable | meaning | |----------|------------| | `BEGIN` | Used as a pattern that matches the start of the file | `END` | Used as a pattern that matches the end of the file | `NR` | Number of Records (running count; in `END`: total nr. of lines) | `NF` | Number of Fields (for each record) | `$0` | Contains entire record (usually a line) | `$1` - `$n` | Contains one column each | `FS` | Input Field Separator (default: any whitespace) | `OFS` | Output Field Separator (default: single space) | `RS` | Input Record Separator (default: newline) | `ORS` | Output Record Separator (default: newline) --- ## `awk` functions | Function | Meaning | |------------------|--------------------------------| | `length(<string>)` | Return number of characters | `tolower(<string>)` | Convert to lowercase | `toupper(<string>)` | Convert to uppercase | `substr(<string>, <start>, <end>)` | Return substring | `split(<string>, <array>, <delimiter>)` | Split into chunks in an array | `sub(<from>, <to>, <string>)` | Substitute (replace) regex | `gsub(<from>, <to> <string>)` | >1 substitution per line | **`print`** | Print, e.g. column: `print $1` | **`exit`** | Break out of record-processing loop; <br> e.g. to stop when match is found | `next` | Don't process later fields: to next iteration | --- ## Counting columns with `awk` - `NF` is the number of *fields*. This finally brings us to the column-counting example shown earlier in the Buffalo chapter: ```sh $ mus=Mus_musculus.GRCm38.75_chr1 $ awk -F "\t" '{print NF; exit}' "$mus".bed #> 3 ``` .content-box-q[ Why do we need the `exit` function here? ] -- - For the GTF file, we need to omit the header lines (starting with a `#`) before we can count columns. We can do this either with `grep` or with `awk` itself: ```sh $ grep -v "^#" "$mus".gtf | awk -F "\t" '{print NF; exit}' #> 9 $ awk -F "\t" '!/^#/ {print NF; exit}' "$mus".gtf #> 9 ``` --- class: inverse middle center name: sed # sed — the **s**tream **ed**itor ---- <br> <br> <br> <br> <br> --- ## Replacing strings with `sed` - `sed` is most often used to perform string replacements, using the syntax `'s/pattern/replacement/[modifiers]'`, where `s` stands for **substitute**. We can replace "chrom" by "chr" in every line of a file as follows: ```sh $ head -n 3 chroms.txt # before sed #> chrom1 3214482 3216968 #> ... $ sed 's/chrom/chr/' chroms.txt | head -n 3 #> chr1 3214482 3216968 #> ... ``` -- <br> .content-box-q[ How does this functionality differ from the `tr` command? ] --- ## Replacing strings with `sed` - `sed` is most often used to perform string replacements, using the syntax `'s/pattern/replacement/[modifiers]'`, where `s` stands for **substitute**. We can replace "chrom" by "chr" in every line of a file as follows: ```sh $ head -n 3 chroms.txt # before sed #> chrom1 3214482 3216968 #> ... $ sed 's/chrom/chr/' chroms.txt | head -n 3 #> chr1 3214482 3216968 #> ... ``` - For **global substitution** (>1 substitution per line), we use the **`g`** modifier, and for **case-insensitive** matching, we use the **`i`** modifier: ```sh $ sed 's/chrom/chr/ig' chroms.txt | head -n 3 ``` --- ## `sed` output options - Note that `sed` does not edit the file in place, and prints to standard out. Usually, we'll redirect the output to a new file: ```sh $ sed 's/chrom/chr/ig' chroms.txt > chroms_renamed.txt ``` -- - When we want replacements in place, **don't redirect to the same file!** ```sh $ sed 's/chrom/chr/ig' chroms.txt > chroms.txt # NO!! ``` -- - We can instruct `sed` to perform the replacement **in place** with the **`-i`** flag, so that we edit the file: ```sh $ cp chroms.txt chroms_inplace.txt $ sed -i 's/chrom/chr/' chroms_inplace.txt $ head -n 3 chroms_inplace.txt #> chr1 3214482 3216968 #> chr1 3216025 3216968 #> chr1 3216022 3216024 ``` --- ## Reformatting using `sed` - Let's say we want to change the genomic coordinates format `chr1:431-874` ("chrom:start-end") to one that has a tab (`"\t"`) between each field. ```sh # Use two consecutive sed calls: $ echo "chr1:431-874" | sed 's/:/\t/' | sed 's/-/\t/' #> chr1 431 874 # Use -e for multiple expressions: $ echo "chr1:431-874" | sed -e 's/:/\t/' -e 's/-/\t/' #> chr1 431 874 # Use a character class to match the ":" and "-" at once: $ echo "chr1:431-874" | sed 's/[:-]/\t/g' #> chr1 431 874 ``` -- .content-box-q[ Can you thing of a fourth way? ] .content-box-answer[ ```sh echo "chr1:431-874" | sed -E 's/:|-/\t/g' ``` ] --- ## Reformatting using `sed` - Let's say we want to change the genomic coordinates format `chr1:431-874` ("chrom:start-end") to one that has a tab (`"\t"`) between each field. ```sh # Use two consecutive sed calls: $ echo "chr1:431-874" | sed 's/:/\t/' | sed 's/-/\t/' #> chr1 431 874 # Use -e for multiple expressions: $ echo "chr1:431-874" | sed -e 's/:/\t/' -e 's/-/\t/' #> chr1 431 874 # Use a character class to match the ":" and "-" at once: $ echo "chr1:431-874" | sed 's/[:-]/\t/g' #> chr1 431 874 ``` .content-box-info[ As we're replacing single characters, `tr` would also work here:: ```sh $ echo "chr1:431-874" | tr ":-" "\t" ``` ] --- ## <i class="fa fa-user-edit"></i> Your turn: `sed` Make the following replacements in `Mus_musculus.GRCm38.75_chr1_genes.txt` with `sed`. For 1. and 2., don't write to new files: check your result by piping into `head`. 1. Oops, the geneIDs referenced the wrong organism! Replace `ENSMUS` by `GALGAL`. 2. Replace the tabs by an underscore (`_`). 3. Do both at the same time, and write to a new file `genes_fixed.txt`. --- ## <i class="fa fa-user-edit"></i> Solutions: `sed` 1. Replace `ENSMUS` by `GALGAL`. ```sh $ sed 's/ENSMUS/GALGAL/' "$mus"_genes.txt | head ``` 2. Replace the tabs by an underscore (`_`). ```sh $ sed 's/\t/_/' "$mus"_genes.txt | head ``` 3. Do both at the same time, and write to a new file `genes_fixed.txt`. ```sh # Approach 1 -- Pipe into a second sed call: $ sed 's/ENSMUS/GALGAL/' "$mus"_genes.txt | \ sed 's/\t/_/' > genes_fixed.txt # Approach 2 -- Use two expressions with "-e": $ sed -e 's/ENSMUS/GALGAL/' -e 's/\t/_/' "$mus"_genes.txt \ > genes_fixed.txt ``` --- ## Backreferences in regular expressions Let's say we need to reorder columns or other information we are matching with regular expressions. How can we do this with `sed`? We can use **backreferences**, which allow you to capture a matched string, and *recall it*. (A general regex feature: we will see this in Python.) .content-box-green[ Backreferences are *captured* in parentheses: **`(pattern1)(pattern2)`**, and *recalled* using **`\1`** for the first, **`\2`** for the second, etc. ] -- - For instance, in `sed`, to invert the order of two words: ```sh $ echo "inverted words" | sed -E 's/(\w+) (\w+)/\2 \1/' # words inverted ``` -- .content-box-info[ For `grep` & `sed`, don't forget to turn on extended regex with **`-E`** to allow for backreferences withour escape backslashes! ] --- ## Backreferences in `sed` for reformatting - Similarly, we can invert these two fields: ```sh $ echo "431-874" | \ sed -E 's/([0-9]+)-([0-9]+)/\2-\1/' #> 874-431 ``` - Going back to the format that has `chr` as well: ```sh $ echo "chr1:431-874" | \ sed -E 's/(chr[0-9]+):([0-9]+)-([0-9]+)/\1:\3-\2/' #> chr1:874-431 ``` -- - What if *some* chromosomes had names like `chr9a`? To allow for any character up until the delimiter (**`:`**), we could use **`:`** as a character class and *negate* it: ```sh $ echo "chr9a:425-787" | \ sed -E 's/(chr[^:]+):([0-9]+)-([0-9]+)/\1:\3-\2/' #> chr9a:787-425 ``` --- ## Print only matching or selected lines using `sed` - Let's say we want to **extract a list of transcript IDs** from a GTF file, which are formatted as `"transcript_id "ENSMUST00000160944"`. ```sh $ grep -v "^#" "$mus".gtf | \ sed -E 's/.*transcript_id "([^"]+)".*/\1/' | \ head -n 3 #> 1 pseudogene gene 3054233 3054733 . + . gene_id "ENSMUSG00000090025"; gene_name "Gm16088"; gene_source "havana"; gene_biotype "pseudogene"; #> ENSMUST00000160944 #> ENSMUST00000160944 ``` -- .content-box-q[ What went wrong? ] -- .content-box-answer[ By default, non-matching lines are also printed – and for those lines, there was no replacement to be made, and they were printed in full. ] --- ## Print only matching or selected lines using `sed` - To fix this, we can tell `sed` to only print matching lines by: - Using the **`-n`** option, which turns off the default behavior of printing all lines, *and* - Using the **`p`** modifier for matches, to print just matching lines: ```sh # Print only lines containing the pattern 'abc' (mimics grep!): $ sed -n '/abc/p' ``` -- - So let's apply this technique to our GTF file: ```sh $ grep -v "^#" "$mus".gtf | \ sed -E -n 's/.*transcript_id "([^"]+)".*/\1/p' | \ head -n 3 #> ENSMUST00000160944 #> ENSMUST00000160944 #> ENSMUST00000082908 ``` --- ## Final `sed` - We can also use the `-n` and `p` syntax to simply print specific lines identified by their line numbers: ```sh $ sed -n '20,50p' "$mus".gtf # Print lines 20-50 ``` -- <br> - You can use **other delimiters than `/`** in the substitution command, which can be useful when patterns and/or replacement contain slashes: ```sh $ echo "data/fastq/sampleA.fastq" | sed 's#data/fastq/##' #> sampleA.fastq ``` .content-box-info[ Almost any character works as the delimiter, but the "rarer" the character, the better. ] --- ## Back to McIlroy's oneliner Print the 10 most common words in a file: ```sh $ cat file | \ tr -cs A-Za-z '\n' | \ tr A-Z a-z | \ sort | uniq -c | \ sort -rn | \ sed 10q ``` --- ## Back to McIlroy's oneliner Print the 10 most common words in a file: ```sh $ cat file | \ tr -cs A-Za-z '\n' | \ # Convert non-alphanum. chars to newlines & "squeeze" tr A-Z a-z | \ # Convert all letters to lowercase sort | uniq -c | \ # Get counts for every word sort -rn | \ # Sort in rev. num. order: highest first sed 10q # Print first 10 lines ``` --- class: inverse middle center # Questions? ---- <br> <br> <br> <br> --- class: inverse middle center # Bonus Material ---- <br> .left[ - ### [Greedy and non-greedy matching with regex](#greed) - ### [Another sed example](#more-sed) - ### [Process substitution](#process-sub) - ### [Subshells](#subshells) - ### [Optional installations: GNU tools and bioawk](#install) ] <br> <br> <br> <br> ---- <br> <br> <br> <br> --- background-color: #f2f5eb name: greed ## Greedy and non-greedy matching with regex - When matching with regular expressions, you need to be careful not to match too "greedily". For instance: When using a character to try to *delimit the end of the match*, but this character occurs multiple times in the line, problems can occur: ```sh $ echo 'transcript_id "ENSMUST00000160944"; gene_name "Gm16088"' \ > greedy_example.txt # We try to delimit the match by a quote `"`, but this fails: $ sed -E 's/transcript_id "(.*)".*/\1/' greedy_example.txt #> ENSMUST00000160944"; gene_name "Gm16088 ``` - Instead, you can basically invert the approach: use a character class to match everything except the delimiting character, so that the match is guaranteed to stop when the delimiting character occurs: ```sh $ sed -E 's/transcript_id "([^"]+)".*/\1/' greedy_example.txt # ENSMUST00000160944 ``` --- background-color: #f2f5eb name: more-sed ## Another `sed` example - Here is the reformatting example from Buffalo, which gives tab-separated (**`\t`**) output: ```sh $ echo "chr1:28427874-28425431" | \ sed -E 's/^(chr[^:]+):([0-9]+)-([0-9]+)/\1\t\2\t\3/' ``` <br> .content-box-info[ **Note the two different uses of the caret sign **`^`** above!** - The first caret, in **`s/^`**, matches the beginning of a line. - The second caret, in **`[^:]`**, is used as the first character inside a character class **`[]`**, which means that is functions to *negate*! So, **`[^:]`** will match any character that is *not* a colon **`:`**. ] --- background-color: #f2f5eb name: process-sub ## Process substitution With process substitution, we can treat command output **as if it had been written to a file**. - Capturing input streams: ```sh program --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) \ --out1 out1.txt --out2 out2.txt ``` - Capturing output streams: ```sh program --in1 in1.txt --in2 in2.txt \ --out1 >(gzip > out1.txt.gz) --out2 >(gzip > out2.txt.gz) ``` - Combining both: ```sh program --in1 <(makein raw1.txt) --in2 <(makein raw2.txt) \ --out1 >(gzip > out1.txt.gz) --out2 >(gzip > out2.txt.gz) ``` .content-box-q[ Why is this useful? Why not write intermediate files? ] --- background-color: #f2f5eb name: subshells ## A subshell and a custom function - "Subshell" between `( )`: both `head` and `tail` get the same standard input: ```sh $ (head -n 2; tail -n 2) < file ``` -- - Creating a custom function with this subshell trick: ```sh # Note that there was an error in Buffalo's function, lacking a trailing ";" $ i() { (head -n 2; tail -n 2) < "$1" | column -t; } ``` - Now you can use `i` ("inspect") to print the first and last two lines of a file: ```sh $ i $mus.bed #> 1 3054233 3054733 #> 1 3054233 3054733 #> 1 195240910 195241007 #> 1 195240910 195241007 ``` .content-box-info[ You'll have to add the function to your bash config file `~/.bashrc` to be able to use it any time. ] --- background-color: #f2f5eb ## Another subshell example ```sh $ (zgrep "^#" "$mus".gtf.gz; \ zgrep -v "^#" "$mus".gtf.gz | sort -k1,1 -k4,4n) | \ gzip > "$mus"_sorted.gtf.gz ``` --- background-color: #f2f5eb name: installs ## Optional install for Mac users: <br> `GNU` instead of `BDS` tools Assumes you have already installed Homebrew — if not, go back to the optional installation instructions in Week 1. ```sh brew install coreutils # Basic tools like ls, cat, head, tail etc. brew install grep # To get GNU grep, not included in basic tools brew install gnu-sed # To get GNU sed, also not included in basic ``` - After this, use `gcat` instead of `cat`, `ggrep` instead of `grep` etc. - To check your installation, e.g.: ```sh ggrep --version ``` --- background-color: #f2f5eb ## Optional install: bioawk ```sh $ git clone git://github.com/lh3/bioawk.git && \ cd bioawk && make && sudo cp bioawk /usr/local/bin/ ```