class:inverse middle center ## *Week 5: Shell scripting* ---- # Part I: <br> Setting the stage for scripting: <br> Variables, loops, arrays, and conditionals <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/02/09 --- ## Setting up - Start VS Code in OSC OnDemand => open your workspace => open a terminal => type `bash` to break out of the Singularity shell. - Move into the Chapter 12 directory in your copy of the Buffalo Git repository, likely at: ```sh $ cd /fs/ess/PAS1855/users/$USER/bds-files/chapter-12-pipelines # Lost the repo? Go to your directory and run: # git clone https://github.com/vsbuffalo/bds-files.git ``` --- class: inverse middle center # Overview ---- .left[ - ### [Variables (and quoting)](#variables) - ### [For loops](#for) - ### [Bash arrays](#arrays) - ### [Conditionals](#conditionals) ] <br> --- class: inverse middle center name: variables # Variables (and quoting) ---- <br> <br> <br> <br> <br> --- ## Variables > *"Processing pipelines having numerous settings that should be stored in variables (e.g., which directories to store results in, parameter values for commands, input files, etc.).* >*Storing these settings in a variable defined at the top of the file makes adjusting settings and rerunning your pipelines much easier.* >*Rather than having to change numerous hardcoded values in your scripts, using variables to store settings means you only have to change one value—the value you’ve assigned to the variable."* — Buffalo ch. 12 <br> In brief, use variables for things that: - You refer to repeatedly. - Are subject to change. -- Similarly, variables are useful to be able to **pass arguments to a script**, so you can easily rerun a script with a different input file / settings / etc. --- ## Assigning and using shell variables A variable is nothing more than a *pointer to the actual data*. .pull-left[ **Assign** a variable (no **`$`**): ```sh # Any string or number: treatment=low nlines=200 options="-lh" # (No spaces around "="!) ``` ] .pull-right[ **Reference** a variable (with **`$`**): ```sh # Any string or number: $ echo $treatment #> low $ expr $nlines / 4 #> 50 $ ls $options #> total 56K #> -rw-rw-r-- 1 jelmer jelmer 1.2K Nov 28 16:54 egfr_flank.fasta ``` ] -- .pull-left[ ```sh # A file name: infile="egfr_flank.fasta" ``` ] .pull-right[ ```sh $ echo $infile #> egfr_flank.fasta $ ls -lh $infile #> -rw-r--r-- 1 jelmer PAS0471 568 Dec 7 09:58 egfr_flank.fasta ``` ] --- ## Assigning and using shell variables (cont.) .pull-left[ **Assign** a variable (no **`$`**): ```sh # Use "command substitution": $ today=$(date +%F) $ nlines=$(wc -l < $infile) ``` ] .pull-right[ **Reference** a variable (with **`$`**): ```sh $ touch README_$today.txt $ ls README_* # README_2020-12-10.txt $ echo "$infile has $nlines lines" #> egfr_flank.fasta has 25 lines ``` ] -- .pull-left[ Pre-existing variables: ("**Environment variables**") ] .pull-right[ ```sh $ echo $USER #> jelmer $ echo $HOME #> /users/PAS0471/jelmer ``` ] --- ## Quoting variables - What happens if the value of our variable contains spaces? ```sh $ today="Thu, Feb 8" $ echo $today #> Thu, Feb 8 $ touch README_$today.txt $ ls #> 8.txt Feb README_Thu, ``` - Oops! The shell performed **field splitting** to split the value into three separate units – as a result, three files were created. This can be avoided by **quoting the variable**: ```sh $ touch README_"$today".txt $ ls #> 'README_Thu, Feb 8.txt' ``` --- ## Quoting variables (cont.) - Without quoting, we also can't explicitly say where a variable name ends: ```sh $ touch README_$today_final.txt #> README_.txt # Intended was "README_thu, Feb 8_final.txt $ day_partial=Thu $ day_full=$day_partialrsday $ echo $day_full #> # Intended was "Thursday" ``` -- - Quoting solves this, too: ```sh $ touch README_"$today"_final.txt #> 'README_Thu, Feb 8_final.txt' $ day_full="$day_partial"rsday $ echo $day_full #> Thursday ``` .content-box-info[ It is good practice to quote your variables – certainly in scripts. ] --- ## Quoting variables (cont.) .content-box-info[ Putting variable names between curly braces will also make it clear where the variable name begins and ends, but it does not prevent field splitting: ```sh $ touch README_${today}_final.txt #> 8_final.txt .txt Feb README_Thu, ``` But you can combine curly braces and quoting: ```sh $ touch README_"${today}"_final.txt #> 'README_Thu, Feb 8_final.txt' ``` (Realizing that `$today` and `${today}` are equivalent will also make the bash array notation –treated soon– seem a little less mystifying.) ] --- class: inverse middle center name: for # For loops ---- <br> <br> <br> <br> <br> --- ## For loops ```sh $ for mushroom in morel destroying_angel eyelash_cup; do echo "$mushroom is an Ohio mushroom" done #> morel is an Ohio mushroom #> destroying_angel is an Ohio mushroom #> eyelash_cup is an Ohio mushroom ``` <br> <br> <br> <br> <br> <br> | Keyword | Purpose |---------|--------- | `for` | Set the loop variable name | `in` | Specify whatever it is we are looping over | `do` | Specify what we want to do with each item | `done` | Tell the shell we are done with the loop --- ## For loops ```sh $ for mushroom in morel destroying_angel eyelash_cup; do echo "$mushroom is an Ohio mushroom" done #> morel is an Ohio mushroom #> destroying_angel is an Ohio mushroom #> eyelash_cup is an Ohio mushroom ``` ```sh $ for fungus in morel destroying_angel eyelash_cup; do echo "$fungus is an Ohio mushroom" done #> morel is an Ohio mushroom #> destroying_angel is an Ohio mushroom #> eyelash_cup is an Ohio mushroom ``` | Keyword | Purpose |---------|--------- | `for` | Set the loop variable name | `in` | Specify whatever it is we are looping over | `do` | Specify what we want to do with each item | `done` | Tell the shell we are done with the loop --- ## For loops ```sh $ for mushroom in morel destroying_angel eyelash_cup; do echo "$mushroom is an Ohio mushroom" done #> morel is an Ohio mushroom #> destroying_angel is an Ohio mushroom #> eyelash_cup is an Ohio mushroom ``` .content-box-info[ A semicolon `;` separates two commands written on a single line: ```sh $ cd data; ls # Change dir _and then_ list files ``` ] -- .content-box-info[ Alternative formatting that puts `do` on its own line: ```sh $ for fungus in morel destroying_angel eyelash_cup do echo "$fungus is an Ohio mushroom" done ``` ] --- ## For loops (cont.) - In practice, we rarely enumerate by hand the items we want to loop over. Instead, we commonly loop over files using globbing: ```sh $ for fastq_file in data/raw/*fastq.gz; do echo "File $fastq_file has $(wc -l < $fastq_file) lines." # More processing... done ``` - Or perhaps over some predefined sequence like so: ```sh $ seq 1 100 #> 1 #> 2 #> ... $ for i in $(seq 1 100); do echo "Run $i out of 100" # More processing... done ``` --- ## For loops (cont.) - We can also loop over items in an *array*: ```sh $ for sample_name in "${sample_names[@]}"; do echo "Processing sample $sample_name" input_file="$sample_name"_R1.fastq # More processing... done ``` <br> <br> <br> <br> .content-box-info[ For another type of loop, the `while` loop, see [these bonus slides](#while). ] --- class: inverse middle center name: arrays # Bash arrays ---- <br> <br> <br> <br> <br> --- ## First – Our metadata files - The file `samples.txt` is in the Buffalo repo: ```sh $ cat samples.txt #> zmaysA R1 seq/zmaysA_R1.fastq #> zmaysA R2 seq/zmaysA_R2.fastq #> zmaysB R1 seq/zmaysB_R1.fastq #> zmaysB R2 seq/zmaysB_R2.fastq #> zmaysC R1 seq/zmaysC_R1.fastq #> zmaysC R2 seq/zmaysC_R2.fastq ``` - We will also create a file with just the paths to the FASTQ files, by selecting only the third column from `samples.txt`: ```sh $ cut -f 3 samples.txt > fastq_files.txt $ head -n 3 fastq_files.txt #> seq/zmaysA_R1.fastq #> seq/zmaysA_R2.fastq #> seq/zmaysB_R1.fastq ``` --- ## Bash arrays Bash arrays are like lists in Python or vectors in R, and can be created using parentheses `()`. - Create an array manually: ```sh $ sample_names=(zmaysA zmaysB zmaysC) ``` - Create an array using command substitution: ```sh $ sample_files=($(cut -f 3 samples.txt)) $ sample_files=($(cat fastq_files.txt)) ``` --- ## Bash arrays (cont.) - Using `[@]`, we can access all elements in the array: ```sh $ echo ${sample_names[@]} # Print all elements in the array #> zmaysA zmaysB zmaysC ``` - Arrays are mostly useful to loop over, and we can also use the `[@]` notation to loop over the elements: ```sh $ for sample_name in "${sample_names[@]}"; do echo "$sample_name" done #> zmaysA zmaysB zmaysC ``` --- ## Bash arrays (cont.) .content-box-info[ **Other array operations** - Extract specific elements (note: Bash arrays are 0-indexed!): ```sh $ echo ${sample_names[0]} #> zmaysA $ echo ${sample_names[2]} #> zmaysC ``` - Count the number of elements: ```sh echo ${#sample_names[@]} #> 3 ``` - Enumerate the element indices: ```sh $ echo ${!sample_names[@]} #> 0 1 2 ``` ] --- ## Arrays and filenames with spaces ```sh $ cat files.txt #> file_A #> file_B #> file_C #> file D ``` ```sh $ all_files=($(cat files.txt)) $ for file in "${all_files[@]}"; do echo $file done #> file_A #> file_B #> file_C #> file #> D ``` --- ## Arrays and filenames with spaces ```sh $ all_files=($(cat files.txt)) $ for file in "${all_files[@]}"; do echo $file done #> file_A #> file_B #> file_C #> file #> D ``` .content-box-info[ For this reason, it's best not to use arrays to loop over filenames with spaces — though see [these bonus slides](#IFS) for a workaround. *Direct globbing* and `while` loops with the `read` function (`while read ...`) are easier choices for problematic file names – again, see [these bonus slides](#IFS) for details. **Most of all, this once again shows you should not have spaces in your file names!** ] --- ## <i class="fa fa-user-edit"></i> Bash arrays: Your turn 1. Create an array with the first three file names (lines) listed in `samples.txt`. 2. Loop over the contents of the array with a `for` loop. Inside the loop, create (`touch`) the file listed in the current array element. 3. Check whether you created your files. --- ## <i class="fa fa-user-edit"></i> Bash arrays: Solutions 1. Create an array with the first three file names (lines) listed in `samples.txt`. ```bash $ good_files=($(head -n 3 files.txt)) ``` 2. Loop over the contents of the array with a `for` loop. Inside the loop, create (`touch`) the file listed in the current array element. ```bash $ for good_file in ${good_files[@]}; do touch $good_file done ``` 3. Check whether you created your files. ```bash $ ls #> file_A file_B file_C ``` --- class: inverse middle center name: arrays # Conditionals ---- <br> <br> <br> <br> <br> --- ## Conditionals - With conditionals, we can **run one or more commands only if some condition is true**. Also, we can run a different set of commands if the condition is _not_ true. For instance, we may want to process a file differently depending on its file type. - This is the basic syntax of an `if` statement in bash: ```sh $ if <test>; then # Command(s) if condition is true else # Commands(s) if condition is false fi ``` --- ## Conditionals - With conditionals, we can **run one or more commands only if some condition is true**. Also, we can run a different set of commands if the condition is _not_ true. For instance, we may want to process a file differently depending on its file type. - For instance (pseudocode): ```sh $ if <file is a text file>; then <count the number of lines and columns> else <print the file type and file size> fi ``` -- - We can also omit the `else` statement: ```sh $ if <test>; then # Command(s) if condition is true fi ``` --- ## Conditionals Example – an `if` statement that depends on the file type of the input file: ```sh $ if test "$filetype" = "fastq"; then echo "Processing fastq file..." # Commands to process fastq file else echo "Unknown filetype!" fi ``` Above, we use the `test` command to perform the test. --- ## Conditionals An `if` statement that depends on the file type of the input file: ```sh $ if [ "$filetype" = "fastq" ]; then echo "Processing fastq file..." # Commands to process fastq file else echo "Unknown filetype!" fi ``` Above, we use the `[ ]` syntax to perform the test: - The square brackets represent a *test statement*. - The spaces bordering the brackets on the inside are necessary: `["$filetype" = "fastq"]` would fail! --- ## Comparison operators (Buffalo Table 12-1) | String | Description |-------------------|---------------- | `-z str` | String `str` is null (empty) | `str1 = str2` | Strings `str1` and `str2` are identical | `str1 != str2` | Strings `str1` and `str2` are different --- ## Comparison operators (Buffalo Table 12-1) | String | Description |-------------------|---------------- | `-z str` | String `str` is null (empty) | `str1 = str2` | Strings `str1` and `str2` are identical | `str1 != str2` | Strings `str1` and `str2` are different | Integer | Description |-------------------|---------------- | `int1 -eq int2` | Integers `int1` and `int2` are equal | `int1 -ne int2` | Integers `int1` and `int2` are not equal | `int1 -lt int2` | Integer `int1` is less than `int2` | `int1 -gt int2` | Integer `int1` is greater than `int2` | `int1 -le int2` | Integer `int1` is less than or equal to `int2` | `int1 -ge int2` | Integer `int1` is greater than or equal to `int2` -- .content-box-q[ Why can we not just use `int1 > int2`, and so on? ] --- ## Another example with a comparison operator Say that we want to run a program with options that depend on our number of samples: ```sh $ n_samples=$(wc -l < samples.txt) $ if [ "$n_samples" -gt 9 ]; then # If the nr of samples is >9 # Commands to run if nr of samples >9: echo "Processing files with algorithm A" else # Commands to run if nr of samples is <=9: echo "Processing files with algorithm B..." fi ``` --- ## File and directory test expressions <br> (Buffalo Table 12-2) It can be useful to perform tests regarding file types and permissions, for which the following expressions are available: | File/dir expression | Description |---------------------------|------------ | `-d dir` | Dir is a directory | `-f file` | File is a file | `-e file` | File exists | `-h link` | Link is a link | `-r file` | File is readable | `-w file` | File is writable | `-x file` | File is executable <br> (or accessible if argument is expression) --- ## File test example - For instance, we can test whether an input file exists and is a regular file before trying to do anything with it: ```sh $ if [ ! -f "$fastq_file" ]; then echo "Error: File not found!" exit 1 fi ``` --- ## Combining expressions - We can use the `&&` and `||` shell operators to test for multiple conditions: - If the file is not a regular file **or** the file can't be read: ```sh $ if [ ! -f "$fastq_file" ] || [ ! -r "$fastq_file" ]; then ``` - If the number of samples is more than 100 **and** at least 50: ```sh $ if [ "$n_samples" -lt 100 ] && [ "$n_samples" -ge 50 ]; then ``` --- ## Combining expressions (cont.) .content-box-info[ The `&&` and `||` shell operators can generally be used to chain commands together. - If the first command *succeeds*, then do the second: ```sh $ cd data && ls data $ git add --all && git commit -m "Add README" && git push ``` - If the first command *fails*, then do the second: ```sh $ cd "$outdir" || echo "Cannot change directory!" $ [ -d "$outdir" ] || mkdir "$outdir" ``` ] --- ## Using the *exit status* of a command as a test - If `grep` finds a match, its **exit status** is 0 which evaluates to "true", whereas if it doesn't, its exit status is 1 which evaluates to "false". We can use this to plug the `grep` command in as the test statement: ```sh $ if grep "AAACGT" "$fastq_file" > /dev/null; then # Commands to run if pattern is found: echo "Warning: contaminant detected in $fastq_file" else # Commands to run if pattern is not found: echo "Processing fastq file..." fi ``` <br> .content-box-info[ We need to disregard `grep`'s standard output (any matches), which we can do by redirecting it to non-existent file `/dev/null`. ] --- ## <i class="fa fa-user-edit"></i> Conditionals: Your turn 1. Set up an `if` statement that tests whether `template.sh` is executable. 2. If it is (`then` block), report the outcome with `echo` (e.g. "The file is executable".) 3. If it is not (`else` block), also report that outcome with `echo`. --- ## <i class="fa fa-user-edit"></i> Conditionals: Solutions ```sh $ if [ -x template.sh ]; then echo "The file is executable" else echo "The file is not executable" fi ``` --- name: basename ## Recap of the `basename` command - Running `basename` on a filename will strip any directories in its name: ```sh $ basename seqs/zmaysA_R1.fastq #> zmaysA_R1.fastq $ basename seqs/raw/fastq/zmaysA_R1.fastq #> zmaysA_R1.fastq ``` -- - You can also provide a *suffix to strip* in either of these ways: ```sh $ basename seqs/zmaysA_R1.fastq ".fastq" #> zmaysA_R1 $ basename -s ".fastq" seqs/zmaysA_R1.fastq #> zmaysA_R1 ``` -- .content-box-info[ If you instead want the directory part of the path, use `dirname`: ```sh $ dirname seqs/zmaysA_R1.fastq #> seqs ``` ] --- class: inverse middle center # Questions? ---- <br> <br> <br> <br> --- class: inverse middle center # Bonus material ---- <br> .left[ - ### [While loops (and the read command)](#while) - ### [The IFS and working with problematic file names](#ifs) - ### [Extended test syntax with [[ ]]](#extended-test) ] <br> <br> <br> <br> --- class: inverse middle center name: while # While loops (and the read command) ---- <br> <br> <br> <br> <br> --- background-color: #f2f5eb ## While loops - `while` loops are similar to `for` loops, but instead of looping over a predefined listing of items, **the loop will keep going as long as a condition is true**. (Infinite loops possible...) For instance: ```sh $ i=1 $ while [ $i -lt 10 ]; do # => While i is less than 10 echo $i i=$(expr $i + 1) done #> 1 #> 2 #> 3 #> 4 #> 5 #> 6 #> 7 #> 8 #> 9 ``` --- background-color: #f2f5eb ## While loops (cont.) In bash, `while` loops are mostly useful in combination with the `read` command, **to loop over each line in a file**. `while read -r` will be true as long as there is a line left to be read from the file, and upon each iteration of the loop, the variable contains one line: ```sh $ cat fastq_files.txt | while read -r fastq_file; do echo "Processing file: $fastq_file" # More processing... done #> Processing file: seq/zmaysA_R1.fastq #> Processing file: seq/zmaysA_R2.fastq #> Processing file: seq/zmaysB_R1.fastq ``` --- background-color: #f2f5eb ## While loops (cont.) A more elegant but perhaps confusing syntax variant used input redirection instead of `cat`-ing the file: ```sh $ while read -r fastq_file; do echo "Processing file: $fastq_file" # More processing... done < fastq_files.txt #> Processing file: seq/zmaysA_R1.fastq #> Processing file: seq/zmaysA_R2.fastq #> Processing file: seq/zmaysB_R1.fastq ``` --- background-color: #f2f5eb ## While loops (cont.) - If needed, we can also pre-process each line of the file inside the `while` loop, for instance if we need to select a specific column: ```sh $ head -n 2 samples.txt #> zmaysA R1 seq/zmaysA_R1.fastq #> zmaysA R2 seq/zmaysA_R2.fastq ``` ```sh while read -r my_line; do echo "Have read line: $my_line" fastq_file=$(echo "$my_line" | cut -f 3) echo -e "Processing file: $fastq_file \n" # More processing... done < samples.txt #> Have read line: zmaysA R1 seq/zmaysA_R1.fastq #> Processing file: seq/zmaysA_R1.fastq #> #> Have read line: zmaysA R2 seq/zmaysA_R2.fastq #> Processing file: seq/zmaysA_R2.fastq #> ``` --- background-color: #f2f5eb ## While loops (cont.) - Alternatively, we can extract columns directly: ```sh $ head -n 2 samples.txt #> zmaysA R1 seq/zmaysA_R1.fastq #> zmaysA R2 seq/zmaysA_R2.fastq ``` ```sh $ while read -r sample_name readpair_member fastq_file; do echo "Processing file: $fastq_file" # More processing... done < metadata.txt #> Processing file: seq/zmaysA_R1.fastq #> Processing file: seq/zmaysA_R2.fastq #> Processing file: seq/zmaysB_R1.fastq ``` --- class: inverse middle center name: ifs # The IFS and working with problematic file names ---- <br> <br> <br> <br> <br> --- background-color: #f2f5eb ## Setting `IFS` to work with arrays <br> with problematic filenames The problematic splitting behavior can be circumvented by setting the Internal Field Separator (`IFS`), which guides the splitting behavior: ```sh $ OFS="$IFS" # Store original field separator to restore later $ IFS="\n" # Set new IFS to ONLY newlines $ all_files=($(cat files.txt)) $ for file in "${all_files[@]}"; do echo $file done #> file_A #> file_B #> file_C #> file D $ IFS="$OFS" # Restore original IFS ``` --- background-color: #f2f5eb ## Other solutions to account for problematic filenames - Globbing over filenames with spaces does work as intended: ```sh $ ls #> file_A file_B file_C file D $ for file in file_*; do echo $file done #> file_A #> file_B #> file_C #> file D ``` .content-box-warning[ You may be tempted to use `ls` instead: `for file in $(ls file_*)`, but this will **not** work for filenames with spaces. Looping over globbing output is a shorter syntax *and* safer. ] --- background-color: #f2f5eb ## Other solutions to account for problematic filenames - `while read` will also work: ```sh while read -r my_file; do echo "Processing file: $my_file" # More processing... done < files.txt ``` <br> .content-box-info[ Moreover, if necessary, you can set the `IFS` more conveniently: ```sh while IFS="\n" read -r my_file; do ... ``` With this syntax, the shell environment won't be affected elsewhere, so you don't have to reset the `IFS`. ] --- background-color: #f2f5eb ## Extended test syntax One can also use the *extended test syntax* with double brackets. With those, you *can* use operators `&&` and `||` inside the test construct: ```sh $ [[ "$n_samples" -lt 100 && "$n_samples" -ge 50 ]] ```