pracs-sp21: Shell tools

Relevant course modules

Basic commands

Command	Description	Examples / options
`pwd`	Print current working directory (dir).	`pwd`
`ls`	List files in working dir (default) or elsewhere.	`ls data/` `-l` long format `-h` human-readable file sizes `-a` show hidden files
`cd`	Change working dir. As with all commands, you can use an absolute path (starting from the root dir `/`) or a relative path (starting from the current working dir).	`cd /fs/ess/PAS1855` (With absolute path) `cd ../..` (Two levels up) `cd -` (To previous dir)
`cp`	Copy files or, with `-r`, dirs and their contents (i.e., recursively). If target is a dir, file will keep same name; otherwise, a new name can be provided.	`cp .fq data/` (All .fq files into dir data)* `cp my.fq data/new.fq` (With new name) `cp -r data/ ~` (Copy dir and contents to home dir)
`mv`	Move/rename files or dirs (`-r` not needed). If target is a dir, file will keep same name; otherwise a new name can be provided.	`mv my.fq data/` (Keep same name) `mv my.fq my.fastq` (Simple rename) `mv file1 file2 mydir/` (Last arg is destination)
*`rm`*	Remove files or dirs/recursively (with `-r`). With `-f` (force), any write-protections that you have set will be overridden.	`rm fq` (Remove all matching files)* `rm -r mydir/` (Remove dir & contents) `-i` Prompt for confirmation `-f` Force remove
`mkdir`	Create a new dir. Use `-p` to create multiple levels at once and to avoid an error if the dir exists.	`mkdir my_new_dir` `mkdir -p new1/new2/new3`
`touch`	If file does not exist: create empty file. If file exists: change last-modified date.	`touch newfile.txt`
`cat`	Print file contents to standard out (screen).	`cat my.txt` `cat .fa > concat.fq` (Concatenate files)*
`head`	*Print the first* 10 lines of a file** or specify number with `-n <n>` or shorthand `-<n>`.	`head -n 40 my.fq` (print 40 lines) `head -40 my.fq` (equivalent)
`tail`	Like `head` but *print the last* lines**.	`tail -n +2 my.csv` (“trick” to skip first line) `tail -f slurm.out` (“follow” file)
`less`	View a file in a file pager; type `q` to exit. See below for more details.	`less myfile` `-S` disable line-wrapping
`column -t`	View a tabular file with columns nicely lined up in the shell.	Nice viewing of a CSV file: `column -s "," -t my.csv`
`history`	Print previously issued commands.	`history \| grep "cut"` (Find previous `cut` usage)
`chmod`	Change file permissions for file owner (user, `u`), “group” (`g`), others (`o`) or everyone (all; `a`). Permissions can be set for reading (`r`), writing (`w`), and executing (`x`). ddddddddddddddddddddddddddddddddddddd	`chmod u+x script.sh` (Make script executable) `chmod a=r data/raw/` (Make data read-only)* `-R` recursive ddddddddddddddddddddddddddddddddddddddddddddd

Data tools

Command	Description	Examples and options
`wc -l`	Count the number of lines in a file.	`wc -l my.fq`
`cut`	Select one or more columns from a file.	Select columns 1-4: `cut -f 1-4 my.csv` `-d ","` comma as delimiter
`sort`	Sort lines. The `-V` option will successfully sort `chr10` after `chr2`. etc.	Sort column 1 alphabetically, column 2 reverse numerically: `sort -k1,1 -k2,2nr my.bed` `-k 1,1` by column 1 only `-n` numerical sorting `-r` reverse order `-V` recognize number with string
`uniq`	*Remove consecutive* duplicate lines** (often from single-column selection): i.e., removes all duplicates if input is sorted.	Unique values for column 2: `cut -f2 my.tsv \| sort \| uniq`
`uniq -c`	If input is sorted, create a count table for occurrences of each line (often from single-column selection).	Count table for column 3: `cut -f3 my.tsv \| sort \| uniq -c`
`tr`	Substitute (translate) characters or character classes (like `A-Z` for uppercase letters). Does not take files as argument; piping or redirection needed. To “squeeze” (`-s`) is to remove consecutive duplicates (akin to `uniq`).	TSV to CSV: `cat my.csv \| tr "\t" ","` Uppercase to lowercase: `tr A-Z a-z < in.txt > out.txt` `-d` delete `-s` squeeze
`grep`	Search files for a pattern and print matching lines (or only the matching string with `-o`). Default regex is basic (GNU BRE): use `-E` for extended regex (GNU ERE) and `-P` for Perl-like regex. To print lines surrounding a match, use `-A n` (`n` lines after match) or `-B n` (`n` lines before match) or `-C n` (`n` lines before and after match). ddddddddddddddddddddddddddddddddddddddd	Match AAC or AGC: `grep "A[AG]C" my.fa` Omit comment lines: `grep -v "^# my.gff` `-c` count `-i` ignore case `-r` recursive `-v` invert `-o` print match only

Miscellaneae

Symbol	Meaning	example
`/`	Root directory.	`cd /`
`.`	Current working directory.	`cp data/file.txt .` (Copy to working dir) Use `./` to execute script if not in `$PATH`: `./myscript.sh`
`..`	One directory level up.	cd `../..` (Move 2 levels up)
`~` or `$HOME`	Home directory.	`cp myfile.txt ~` (Copy to home)
`$USER`	User name.	`mkdir $USER`
`>`	Redirect standard out to a file.	`echo "My 1st line" > myfile.txt`
`>>`	Append standard out to a file.	`echo "My 2nd line" >> myfile.txt`
`2>`	Redirect standard error to a file.	Send standard out and standard error for a script to separate files: `myscript.sh >log.txt 2> err.txt`
`&>`	*Redirect standard out and* standard error** to a file.	`myscript.sh &> log.txt`
`\|`	Pipe standard out (output) of one command into standard in (input) of a second command	The output of the `sort` command will be piped into head to show the first lines: `sort myfile.txt \| head`
`{}`	Brace expansion. Use `..` to indicate numeric or character ranges (`1..4` => `1`, `2`, `3`, `4`) and `,` to separate items.	`mkdir Jan{01..31}` (Jan01, Jan02, …, Jan31) `touch fig1{A..F}` (fig1A, fig1B, …, fig1F) `mkdir fig1{A,D,H}` (fig1A, fig1D, fig1D)
`$()`	Command substitution. Allows for flexible usage of the output of any command: e.g., use command output in an `echo` statement or assign it to a variable.	Report number of FASTQ files: `echo "I see $(ls fastq \| wc -l) files"` Substitute with date in YYYY-MM-DD format:* `mkdir results_$(date +%F)` `nlines=$(wc -l < $infile)`
`$PATH`	Contains colon-separated list of directories with executables: these will be searched when trying to execute a program by name. ddddddddddddddddddddddddddddddddddddd	Add dir to path: `PATH=$PATH:/new/dir` (But for lasting changes, edit the Bash configuration file `~./bashrc`.) dddddddddddddddddddddddddddddddd

Shell wildcards

Wildcard	Matches
*	Any number of any character, including nothing.	`ls data/fastq.gz` (Matches any file ending in “fastq.gz”)* `ls R1` (Matches any file containing “R1” somewhere in the name.)
?	Any single character.	`ls sample1_?.fastq.gz` (Matches `sample1_A.fastq.gz` but not `sample1_AA.fastq.gz`)
[] and [^]	One or none (`^`) of the “character set” within the brackets. ddddddddddddddddddddddddddddddddddddd	`ls fig1[A-C]` (Matches `fig1A`, `fig1B`, `fig1C`) `ls fig[0-3]` (Matches `fig0`, `fig1`, `fig2`, `fig3`) `ls fig[^4]` (Does not match files with a “4” after “fig”)* ddddddddddddddddddddddddddddddddddddddd

Regular expressions

Note: ERE = GNU “Extended Regular Expressions”. If “yes” in ERE column, then the symbol needs ERE to work¹: use a -E flag for grep and sed (note that awk uses ERE by default) to turn on ERE.

Symbol	ERE	Matches	Example
`.`		Any single character	Match `Olfr` with none or any characters after it: `grep -o "Olfr.*"`
*``**		Quantifier: matches preceding character any number of times	See previous example.
`+`	yes	Quantifier: matches preceding character at least once	At least two consecutive digits: `grep -E [0-9]+`
`?`	yes	Quantifier: matches preceding character at most once	Only a single digit: `grep -E [0-9]?`
`{m}` / `{m,}` / `{m,n}`	yes	Quantifier: match preceding character `m` times / at least `m` times / *`m`* *to `n`* times**	Between 50 and 100 consecutive Gs: `grep -E "G{50,100}"`
`^` / `$`		Anchors: match beginning / end of line	Exclude empty lines: `grep -v "^$"` Exclude lines beginning with a “#”: `grep -v "^#"`
`\t`		Tab (To match in `grep`, needs `-P` flag for Perl-like regex)	`echo -e "column1 \t column2"`
`\n`		Newline (Not straightforward to match since Unix tools are line-based.)	`echo -e "Line1 \n Line2"`
`\w`	(yes)	“Word” character: any alphanumeric character or “_”. Needs `-E` (ERE) in `grep` but not in `sed`.	Match `gene_id` followed by a space and a “word”: `grep -E -o 'gene_id "\w+"'` Change any word character to X: `sed s/\w/X/`
`\|`	yes	Alternation / logical or: match either the string before or after the `\|`	Find lines with either `intron` or `exon`: `grep -E "intron\|exon"`
`()`	yes	Grouping	Find “AAG” repeated 10 times: `grep (AAG){10}`
`\1`, `\2`, etc.	yes	Backreferences to groups captured with `()`: first group is `\1`, second group is `\2`, etc. ddddddddddddddddddddddddddddddddddddd	Invert order of two words: `sed -E 's/(\w+) (\w+)/\2 \1/'` ddddddddddddddddddddddddddddddddddddd

More details for a few commands

`less`

Key	Function
`q`	Exit `less`
`space` / `b`	Go down / up a page. (`pgup` / `pgdn` usually also work.)
`d` / `u`	Go down / up half a page.
`g` / `G`	Go to the first / last line (`home` / `end` also work).
`/<pattern>` or `?<pattern>`	Search for `<pattern>` forwards / backwards: type your search after `/` or `?`.
`n` / `N`	When searching, go to next / previous search match. dddddddddddddddddddddddddddddddddddddddddddddddddddd

`sed`

`sed` flags:

Flag	Meaning
`-E`	Use extended regular expressions
`-e`	When using multiple expressions, precede each with `-e`
`-i`	Edit a file in place
`-n`	Don’t print lines unless specified with `p` modifier

`sed` examples

# Replace "chrom" by "chr" in every line,
# with "i": case insensitive, and "g": global (>1 replacements per line)
sed 's/chrom/chr/ig' chroms.txt

# Only print lines matching "abc":
sed -n '/abc/p' my.txt

# Print lines 20-50:
sed -n '20,50p'

# Change the genomic coordinates format chr1:431-874 ("chrom:start-end")
# ...to one that has a tab ("\t") between each field:
echo "chr1:431-874" | sed -e 's/:/\t/' -e 's/-/\t/'
#> chr1    431     874

# Invert the order of two words:
echo "inverted words" | sed -E 's/(\w+) (\w+)/\2 \1/'
#> words inverted

# Capture transcript IDs from a GTF file (format 'transcript_id "ID_I_WANT"'):
# (Needs "-n" and "p" so lines with no transcript_id are not printed.) 
grep -v "^#" my.gtf | sed -E -n 's/.*transcript_id "([^"]+)".*/\1/p'

# When a pattern contains a `/`, use a different expression delimiter:
echo "data/fastq/sampleA.fastq" | sed 's#data/fastq/##'
#> sampleA.fastq

`awk`

Records and fields: by default, each line is a record (assigned to $0). Each column is a field (assigned to $1, $2, etc).

Patterns and actions: A pattern is a condition to be tested, and an action is something to do when the pattern evaluates to true.

Omit the pattern: action applies to every record.

awk '{ print $0 }' my.txt     # Print entire file
awk '{ print $3,$2 }' my.txt  # Print columns 3 and 2 for each line

Omit the action: print full records that match the pattern.

# Print all lines for which:
awk '$3 < 10' my.bed          # Column 3 is less than 10
awk '$1 == "chr1"' my.bed     # Column 1 is "chr1"
awk '/chr1/' my.bed           # Regex pattern "chr1" matches
awk '$1 ~ /chr1/' my.bed      # Column 1 _matches_ "chr1"

`awk` examples

# Count columns in a GTF file after excluding the header
# (lines starting with "#"):
awk -F "\t" '!/^#/ {print NF; exit}' my.gtf

# Print all lines for which column 1 matches "chr1" and the difference
# ...between columns 3 and 2 (feature length) is less than 10:
awk '$1 ~ /chr1/ && $3 - $2 > 10' my.bed

# Select lines with "chr2" or "chr3", print all columns and add a column 
# ...with the difference between column 3 and 2 (feature length):
awk '$1 ~ /chr2|chr3/ { print $0 "\t" $3 - $2 }' my.bed

# Caclulate the mean value for a column:
awk 'BEGIN{ sum = 0 };            
     { sum += ($3 - $2) };             
     END{ print "mean: " sum/NR };' my.bed

`awk` comparison and logical operators

Comparison	Description
`a == b`	`a` is equal to `b`
`a != b`	`a` is not equal to `b`
`a < b`	`a` is less than `b`
`a > b`	`a` is greater than `b`
`a <= b`	`a` is less than or equal to `b`
`a >= b`	`a` is greater than or equal to `b`
`a ~ /b/`	`a` matches regular expression pattern `b`
`a !~ /b/`	`a` does not match regular expression pattern `b`
`a && b`	logical and: `a` and `b`
`a` `\|\|` `b`	logical or: `a` or `b` [note typo in Buffalo]
`!a`	not a (logical negation)

`awk` special variables and keywords

keyword/ variable	meaning
`BEGIN`	Used as a pattern that matches the start of the file
`END`	Used as a pattern that matches the end of the file
`NR`	Number of Records (running count; in `END`: total nr. of lines)
`NF`	Number of Fields (for each record)
`$0`	Contains entire record (usually a line)
`$1` - `$n`	Contains one column each
`FS`	Input Field Separator (default: any whitespace)
`OFS`	Output Field Separator (default: single space)
`RS`	Input Record Separator (default: newline)
`ORS`	Output Record Separator (default: newline)

`awk` functions

Function	Meaning
`length(<string>)`	Return number of characters
`tolower(<string>)`	Convert to lowercase
`toupper(<string>)`	Convert to uppercase
`substr(<string>, <start>, <end>)`	Return substring
`split(<string>, <array>, <delimiter>)`	Split into chunks in an array
`sub(<from>, <to>, <string>)`	Substitute (replace) regex
`gsub(<from>, <to> <string>)`	>1 substitution per line
`print`	Print, e.g. column: `print $1`
`exit`	Break out of record-processing loop; e.g. to stop when match is found
`next`	Don’t process later fields: to next iteration

Keyboard shortcuts

Shortcut	Function
`Tab`	Tab completion
`⇧` / `⇩`	Cycle through previously issued commands
`Ctrl`+`Shift`+`C`	Copy selected text
`Ctrl`+`Shift`+`V`	Paste text from clipboard
`Ctrl`+`A` / `Ctrl`+`E`	Go to beginning/end of line
`Ctrl`+`U` / `Ctrl`+`K`	Cut from cursor to beginning / end of line²
`Ctrl`+`W`	Cut word before before cursor³
`Ctrl`+`Y`	Paste (“yank”)
`Alt`+`.`	Last argument of previous command (very useful!)
`Ctrl`+`R`	Search history: press `Ctrl`+`R` again to cycle through matches, `Enter` to put command in prompt.
`Ctrl`+`C`	Kill (stop) currently active command
`Ctrl`+`D`	Exit (a program or the shell depending on the context)
`Ctrl`+`Z`	Suspend (pause) a process: then use `bg` to move to background.

When using the default regular expressions in grep and sed, Basic Regular Expressions (BRE), the symbol would need to be preceded by a backslash to work.↩︎
Ctrl+K doesn’t work by default in VS Code, but can be set there.↩︎
Doesn’t work by default in VS Code, but can be set there.
↩︎

Shell tools

Relevant course modules

Basic commands

Data tools

Miscellaneae

Shell wildcards

Regular expressions

More details for a few commands

less

sed

sed flags:

sed examples

awk

awk examples

awk comparison and logical operators

awk special variables and keywords

awk functions