pracs-sp21: Bonus Exercise: Week 13 -- Snakemake

We did not work with real data in the Snakemake workflows we created so far. Therefore, in this bonus exercise, you’ll create a workflow that processes actual FASTQ files, which should be useful practice if you intend to use Snakemake for your final project or your own research.

We’ll use the same 6 FASTQ files that we explored in the exercises for week 5 and the second graded assignment. These 6 FASTQ files consist of 3 pairs of R1 (forward reads) and R2 (reverse reads) files for 3 biological samples that contain sequences from the V4 region of 16S rRNA, generated in a metabarcoding experiment.

In the workflow, we will perform QC with FastQC and MultiQC, remove primers with Cutadapt, and filter FASTQ files and infer Amplicon Sequence Variants (ASVs) with the R package DADA2.

1. Set-up

You can copy the data and a project skeleton from /fs/ess/PAS1855/exercises/:

cd /fs/ess/PAS1855/users/$USER/week13
cp -r /fs/ess/PAS1855/exercises/week13/bonus/initial_workflow bonus_exercise

# Go into the exercise dir:
cd bonus_exercise

# Make sure all the scripts are executable:
chmod +x scripts/*

Take a look at the files we have:

$ tree
#> .
#> ├── data
#> │   ├── 201_R1.fastq.gz
#> │   ├── 201_R2.fastq.gz
#> │   ├── 202_R1.fastq.gz
#> │   ├── 202_R2.fastq.gz
#> │   ├── 203_R1.fastq.gz
#> │   └── 203_R2.fastq.gz
#> ├── log
#> ├── scripts
#> │   ├── cutadapt.sh
#> │   ├── fastq_filter.R
#> │   ├── infer_ASVs.R
#> │   └── multiqc.sh
#> ├── slurm_profile
#> │   └── config.yaml
#> └── workflow
#>     ├── config.yaml
#>     ├── envs
#>     └── Snakefile

Finally, load the Conda environment in which you have Snakemake installed:

module load python/3.6-conda5.2
source activate ipy-env

# Check if Snakemake can be found and is inside your ipy-env:
which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemake

Here is a DAG (Directed Acyclic Graph) of the full workflow that we will build:

2. Our initial workflow

Full Snakefile #1

"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Settings:
IN_DIR="data"
OUT_DIR="results"
LOG_DIR="log"

PRIMER_F: "GAGTGYCAGCMGCCGCGGTAA"
PRIMER_R: "ACGGACTACNVGGGTWTCTAAT"

# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]

# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        "rm -rf results;"
        "rm -rf log/*"

Current Snakefile

Our current Snakefile has a couple of features we haven’t seen before:

We use generic Python code directly in the Snakefile: we import the join() function from the os.path module, and further down, use this function to build file paths: ¹
```
from os.path import join

FASTQC_DIR=join(OUT_DIR, "fastqc")
```
We have a rule clean, which is a common convenience rule to remove all output and log files from the workflow. As you are developing and testing a workflow, you may want to do this regularly.
In the shell directive of rule clean, we have two lines of shell code, which works because we use Python triple-quotes.

Also of note:

In rule all, the output files of the final workflow are already specified, so we won’t have to change this rule in the exercises below.
We will run the workflow on the cluster in the sense that each Snakemake job will be a SLURM job. Because rule all and rule clean they are not computationally intensive at all, we specify them as “local rules” (i.e., rules not to be submitted as SLURM jobs) using thelocalrules directive. (The overhead of submitting a SLURM job and waiting for it start would just slow thing down for those rules.)

Our “profile” `config.yaml` file

We use a “profile” slurm_profile/config.yaml file very similar to the one in a previous Snakemake exercise:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=4000, time_min=60, account=PAS1855]
latency-wait: 30
jobs: 100
use-conda: true

This file tells Snakemake how to submit jobs to the cluster, and provides some default resource values that we would be able to modify for individual rules (as we will eventually do for rule ASV_infer).

Recall that this file simply contains options that we could also pass at the command line. We can therefore also take the opportunity to specify the maximum number of jobs (which we normally do with the -j option) and ensure that Snakemake creates and uses any Conda environments specified in the Snakefile (use-conda: true; corresponds to the --use-conda option).

Then, we don’t have to specify this at the command line when we call Snakemake, and the only option we need to pass to Snakemake is --profile. For this option, we specify the directory (not the file name, which is assumed to be config.yaml by Snakemake) in which our YAML file is found: snakemake --profile slurm_profile.

2. A config file with workflow settings

Instead of specifying settings like input and output directories as well as primer sequences in our Snakefile, it would be useful to separate such settings from the workflow itself. Snakemake supports the use of a configuration file to do so.

Let’s create a configuration file workflow/config.yaml with the following settings:

in_dir: "data"
out_dir: "results"
log_dir: "log"

primer_f: "GAGTGYCAGCMGCCGCGGTAA"
primer_r: "ACGGACTACNVGGGTWTCTAAT"

The format of the file is YAML just like the slurm_profile/config.yaml above and the files we have used to specify and save Conda environments.

Now, we need a way to tell Snakemake about this config file. We can do so using the configfile directive in the Snakefile. After we do that, all key-value pairs from the config file will be available in the Snakefile in a Python dictionary called config:

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from the config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

Replace the 5 lines of code in the Snakefile under the # Settings: line with the lines printed directly above this box.

Then, assign the primer sequences in config.yaml to the variables PRIMER_F and PRIMER_R, respectively.

Solution

Assign the primer sequences to variables PRIMER_F and PRIMER_R:

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

Full Snakefile #2

"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        "rm -rf results;"
        "rm -rf log/*"

3. A FastQC rule

Now, let’s add a rule to run FastQC:

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc/{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

Some things to note about this rule:

Previously we have always run scripts in the shell directive, but in rule fastqc, we directly run the program FastQC in the shell directive.
Because FastQC doesn’t allow us to specify the output file, we just provide an output directory. Note that in the shell directive, we can access any Snakefile variable using {} (not just values of directives like {input} and {log}). Therefore, we can retrieve the FastQC output dir with {FASTQC_DIR}.
In the input and output directives, we are using two wildcards: {sample} and {read}.
To avoid excessive installations, we load the OSC module for FastQC, here. If we had wanted increased portability, we could have specified a Conda environment using a YAML file like workflow/envs/fastqc.yaml with the following contents:
```
channels:
  - bioconda
dependencies:
  - fastqc=0.11.8
```

When we have Snakemake submit jobs to the cluster, and we also specify log files with the log directive, we will have two log files for every job:

The log file that we define in the Snakefile will contain standard output and/or standard error from our actual shell command.
The SLURM log file will contain’s Snakemake’s standard output and standard error.

Add rule fastqc, with the contents as specified above, to the Snakefile.
Get Snakemake to run rule fastqc for at least one of the FASTQ files.
Check for the presence of output files and look at the log files.

Hints

Our rule all input does not match the output of rule fastqc, so trying to run Snakemake without speciying a rule or output file(s) won’t work (recall that by default, Snakemake will run the first rule, which is rule all). We could add this to rule all, but since we want to add a rule MultiQC in the next step, which uses the output of FastQC, that would only be temporary.

Moreover, rule fastqc contains wildcards in its {input} and {output} directives, so trying to run it using snakemake [...] fastqc would result in an error: Snakemake is not able to figure out what it should run (you can most conveniently try this with a dry-run using snakemake -n fastqc).

snakemake -n fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

To test whether rule functions, it is therefore most convenient to just call Snakemake with an actual output file.

Solution

$ snakemake --profile slurm_profile \
    results/fastqc/201_R1_fastqc.html

$ ls -lh results/fastqc/
#> total 1.9M
#> -rw-r--r-- 1 jelmer PAS0471 1007K Apr 21 09:19 201_R1_fastqc.html
#> -rw-r--r-- 1 jelmer PAS0471  847K Apr 21 09:19 201_R1_fastqc.zip

# This file will show FastQC's output:  
$ cat log/fastqc_201_R1.log
#> Started analysis of 201_R1.fastq.gz
#> Approx 5% complete for 201_R1.fastq.gz
#> Approx 10% complete for 201_R1.fastq.gz
#> Approx 15% complete for 201_R1.fastq.gz
#> Approx 20% complete for 201_R1.fastq.gz
#> Approx 25% complete for 201_R1.fastq.gz
#> Approx 30% complete for 201_R1.fastq.gz
#> Approx 35% complete for 201_R1.fastq.gz
#> Approx 40% complete for 201_R1.fastq.gz
#> Approx 45% complete for 201_R1.fastq.gz
#> Approx 50% complete for 201_R1.fastq.gz
#> Approx 55% complete for 201_R1.fastq.gz
#> Approx 60% complete for 201_R1.fastq.gz
#> Approx 65% complete for 201_R1.fastq.gz
#> Approx 70% complete for 201_R1.fastq.gz
#> Approx 75% complete for 201_R1.fastq.gz
#> Approx 80% complete for 201_R1.fastq.gz
#> Approx 85% complete for 201_R1.fastq.gz
#> Approx 90% complete for 201_R1.fastq.gz
#> Approx 95% complete for 201_R1.fastq.gz
#> Analysis complete for 201_R1.fastq.gz

$ cat log/fastqc_slurm-read\=R1\,sample\=201.out 
#> Building DAG of jobs...
#> Using shell: /usr/bin/bash
#> Provided cores: 1 (use --cores to define parallelism)
#> Rules claiming more threads will be scaled down.
#> Job counts:
#>         count   jobs
#>         1       fastqc
#>         1
#> 
#> [Wed Apr 21 09:19:25 2021]
#> rule fastqc:
#>     input: data/201_R1.fastq.gz
#>     output: results/fastqc/201_R1_fastqc.html
#>     log: log/fastqc_201_R1.log
#>     jobid: 0
#>     wildcards: sample=201, read=R1
#>     resources: cpus=1, mem_mb=4000, time_min=60, account=PAS1855
#> 
#> Activating conda environment: #> /fs/ess/PAS1855/data/week13/exercise_bonus/step2/.snakemake/conda/7e57eac489af13d2bb58773ba0fd9#> c38
#> [Wed Apr 21 09:19:36 2021]
#> Finished job 0.
#> 1 of 1 steps (100%) done

Full Snakefile #3

"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        "rm -rf results;"
        "rm -rf log/*"

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    conda:
        "envs/fastqc.yaml"
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        "fastqc {input} -o {FASTQC_DIR} &> {log}"

4. A MultiQC rule

We will add a rule that runs MultiQC, which summarizes multiple FastQC reports into a single HTML file.

We will run a shell script that in turn runs MultiQC, rather than running MultiQC directly in the shell directive like we did for FastQC.

The main reason for this is an annoyance with installing MultiQC through Conda²: when using a YAML file or attempting a one-step installation, Conda will fail. Instead, we need to do a two-step installation: first, we install Python, and then, we install MultiQC. The scripts/multiqc.sh script that rule multiqc runs will create this environment if it doesn’t already exist, before running MultiQC.

Show the script scripts/multiqc.sh

#!/bin/bash

set -e -u -o pipefail

module load python/3.6-conda5.2

# Process command-line args:
in_dir=$1
out_dir=$2

# Report:
echo "# This script will run MultiQC."
echo "# Input directory: $in_dir"
echo -e "# Output directory: $out_dir \n" 

# Activate/install and activate MultiQC conda environment:
if [[ $(conda env list) = *"multiqc-env"* ]]; then
    echo "# Activating existing Conda environment multiqc-env"
    source activate multiqc-env
else
    echo "# Installing MultiQC in Conda environment multiqc-env"
    conda create -y -n multiqc-env python=3.7
    source activate multiqc-env
    conda install -y -c bioconda -c conda-forge multiqc
fi

# Run MultiQC:
echo -e "\n# Running MultiQC..."
multiqc "$in_dir" -o "$out_dir"

echo -e "\n# Done with script."

Build a rule multiqc to run MultiQC with the script scripts/multiqc.sh.

Recall that MultiQC will run once with all FastQC output files as its input. The output file will be automatically called multiqc_report.html so the output directive should be:
```
output:
    join(MULTIQC_DIR, "multiqc_report.html"),
```
You’ll still need to come up with input, log, and shell directives for the rule.

Hints

Use the expand() function in the input directive. Besides expanding the possible values for SAMPLES and READS, you will also need to include the FastQC output dir here, which is saved in the variable FASTQC_DIR.
In the shell directive, you will call the script scripts/multiqc.sh. Take a look at the script to see what arguments it takes, and specify those.
Like in rule fastqc, send standard output and standard error from the script to {log}.

Solutions

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"

Run rule MultiQC and check the output. (What else will be run automatically?)

Solutions

rule fastqc will also be run for the remaining 5 FASTQ files, since the input of rule multiqc is the output of rule fastqc.

$ snakemake --profile slurm_profile multiqc

$ ls -lh results/multiqc/
#> total 1.2M
#> drwxr-xr-x 2 jelmer PAS0471 4.0K Apr 21 10:38 multiqc_data
#> -rw-r--r-- 1 jelmer PAS0471 1.2M Apr 21 10:38 multiqc_report.html

$ cat log/multiqc.log
#> # This script will run MultiQC.
#> # Input directory: results/fastqc
#> # Output directory: results/multiqc 
#> 
#> # Activating existing Conda environment multiqc-env
#> 
#> # Running MultiQC...
#> [WARNING]         multiqc : MultiQC Version v1.10.1 now available!
#> [INFO   ]         multiqc : This is MultiQC v1.10
#> [INFO   ]         multiqc : Template    : default
#> [INFO   ]         multiqc : Searching   : #> /fs/project/PAS0471/teach/courses/pracs-sp21/week13/exercise_bonus/step3_multiqc/results/fastqc
#> Searching   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 12/12  [INFO   ]          fastqc : #> Found 6 reports
#> [INFO   ]         multiqc : Compressing plot data
#> [INFO   ]         multiqc : Report      : results/multiqc/multiqc_report.html
#> [INFO   ]         multiqc : Data        : results/multiqc/multiqc_data
#> [INFO   ]         multiqc : MultiQC complete
#> 
#> # Done with script.

Full Snakefile #4

"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        """
        rm -rf results
        rm -rf log/*
        """

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"

5. A Cutadapt rule

Next, we will add a rule to run Cutadapt, which will remove the primer sequences from our reads.

rule cutadapt:
    """
    A rule to run Cutadapt to remove primers from the FASTQ files.
    """
    conda:
        "envs/cutadapt.yaml"
    input:
        fwd=join(IN_DIR, "{sample}_R1.fastq.gz"),
        rev=join(IN_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    log:
        join(LOG_DIR, "cutadapt_{sample}.log")
    shell:
        "scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"

Some things to note about this rule:

We are processing one sample and therefore two FASTQ files (forward and reverse) at a time.
Again, we run a script (scripts/cutadapt.sh) rather than calling the program directly in the rule. In this case, that’s the most convenient option because we will have the script compute the reverse complements of the primer sequences.
The script takes four arguments:
1. The name for the FASTQ file with forward reads (the name for the file with reverse reads will be inferred from that).
2. The output directory.
3. The forward primer sequences.
4. The reverse primer sequences.
The output file names are automatically determined by the script.

The script scripts/cutadapt.sh

#!/bin/bash
set -e -u -o pipefail

# This script will run Cutadapt on a pair (R1+R2) of FASTQ files,
# removing primer sequences.
# Forward and reverse primer sequences are passed as argument to the script,
# and the script will compute the reverse complements of each.

# SETUP --------------------------------------------------------
R1_in="$1"          # Path to the file with forward (R1) reads (R1 filename is inferred from  this)
outdir="$2"         # Output directory
primer_f="$3"       # Sequence of the forward primer
primer_r="$4"       # Sequence of the reverse primer

# Test if the correct number of args were passed to the script:
if [ ! "$#" = 4 ]; then
    echo "Error: must provide 4 arguments; you provided $#"
    echo "Usage: cutadapt_single.sh <input-FASTQ-R1> <outdir> <primer_f> <primer_r>"
    exit 1
fi

# Get reverse primer sequences by reverse complementing:
primer_f_rc=$(echo "$primer_f" | tr ATCGYRKMBDHV TAGCRYMKVHDB | rev)
primer_r_rc=$(echo "$primer_r" | tr ATCGYRKMBDHV TAGCRYMKVHDB | rev)

# Infer R2 (reverse) filename: R1 and R2 are assumed to have the same name
# except for "_R1_" vs "_R1_": 
R2_in=${R1_in/_R1/_R2}

# Check that the input dir is not the same as the output dir,
# or the input files will be overwritten:
indir=$(dirname "$R1_in")
if [ "$indir" = "$outdir" ]; then
    echo "## ERROR: Input dir should not be the same as output dir" >&2
    exit 1
fi

# Output files:
R1_out="$outdir"/$(basename "$R1_in" .fastq.gz).fastq.gz
R2_out="$outdir"/$(basename "$R2_in" .fastq.gz).fastq.gz

# Report:
echo -e "\n## Starting cutadapt script."
date
echo
echo "## Using the following parameters:"
echo "## Input file R1: $R1_in"
echo "## Output dir: $outdir"
echo "## Forward primer: $primer_f"
echo "## Reverse primer: $primer_r"
echo
echo "## Input file R2: $R2_in"
echo "## Output file R1: $R1_out"
echo "## Output file R2: $R2_out"
echo
echo "## Reverse complement of forward primer: $primer_f_rc"
echo "## Reverse complement of reverse primer: $primer_r_rc"
echo

# Test:
if [ ! -f "$R1_in" ] || [ ! -r "$R1_in" ]; then
    echo -e "\n## $0: ERROR: Input file R1_in, $R1_in, not found\n" >&2 && exit 1
fi
if [ ! -f "$R2_in" ] || [ ! -r "$R2_in" ]; then
    echo -e "\n## $0: ERROR: Input file R2_in, $R2_in, not found\n" >&2 && exit 1
fi

# Create output directory if it doesn't already exist:
mkdir -p "$outdir"

# RUN CUTADAPT --------------------------------------------------------
echo -e "\n\n## Running cutadapt..."

cutadapt -a "$primer_f"..."$primer_r_rc" -A "$primer_r"..."$primer_f_rc" \
    --discard-untrimmed --pair-filter=any \
    -o "$R1_out" -p "$R2_out" "$R1_in" "$R2_in"

# Options:
# "-a"/"-A": Primers for R1/R2
# "--discard-untrimmed": Remove pairs with no primer found
# "--pair-filter=any": Remove pair if one read is filtered (=Default)

# REPORT AND FINALIZE --------------------------------------------------------
echo -e "\n## Listing output files:"
ls -lh "$outdir"

echo -e "\n## Done with cutadapt script."
date

This time, we will provide Snakemake with a YAML file describing a Conda environment for Cutadapt. Here is the contents of the YAML file envs/cutadapt.yaml:
```
channels:
  - bioconda
dependencies:
  - cutadapt=3.4
```

Add the rule cutadapt with the contents as specified above.
Get Snakemake to run rule cutadapt for at least one of the FASTQ files. Note that this will take a little while the first time around, since Snakemake will create a Conda environment with Cutadapt.
Check for the presence of output files and look at the log files.

Hints

Like with rule fastqc, you’ll need to specify an individual output file.

Solutions

$ snakemake --profile slurm_profile results/cutadapt/201_R1.fastq.gz
#> Building DAG of jobs...
#> Creating conda environment workflow/envs/cutadapt.yaml...
#> Downloading and installing remote packages.
#> [...]

$ ls -lh results/cutadapt/
#> -rw-r--r-- 1 jelmer PAS0471 6.0M Apr 21 11:25 201_R1.fastq.gz
#> -rw-r--r-- 1 jelmer PAS0471 7.1M Apr 21 11:25 201_R2.fastq.gz

$ less log/cutadapt_201.log
#> ## Starting cutadapt script.
#> Wed Apr 21 11:25:00 EDT 2021
#> 
#> ## Using the following parameters:
#> ## Input file R1: data/201_R1.fastq.gz
#> ## Output dir: results/cutadapt
#> ## Forward primer: GAGTGYCAGCMGCCGCGGTAA
#> ## Reverse primer: ACGGACTACNVGGGTWTCTAAT
#> 
#> ## Input file R2: data/201_R2.fastq.gz
#> ## Output file R1: results/cutadapt/201_R1.fastq.gz
#> ## Output file R2: results/cutadapt/201_R2.fastq.gz
#> 
#> ## Reverse complement of forward primer: TTACCGCGGCKGCTGRCACTC
#> ## Reverse complement of reverse primer: ATTAGAWACCCBNGTAGTCCGT
#> 
#> 
#> ## Running cutadapt...
#> This is cutadapt 3.4 with Python 3.9.2
#> Command line parameters: -a GAGTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCCGT -A #> ACGGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCACTC --discard-untrimmed --pair-filter=any -o #> results/cutadapt/201_R1.fastq.gz -p results/cutadapt/201_R2.fastq.gz #> data/201_R1.fastq.gz data/201_R2.fastq.gz
#> Processing reads on 1 core in paired-end mode ...
#> Finished in 4.84 s (103 µs/read; 0.58 M reads/minute).
#> 
#> === Summary ===
#> 
#> Total read pairs processed:             47,070
#>   Read 1 with adapter:                  47,061 (100.0%)
#>   Read 2 with adapter:                  46,675 (99.2%)
#> Pairs written (passing filters):        46,671 (99.2%)

Full Snakefile #5

"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        """
        rm -rf results
        rm -rf log/*
        """

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"

rule cutadapt:
    """
    A rule to run Cutadapt to remove primers from the FASTQ files.
    """
    conda:
        "envs/cutadapt.yaml"
    input:
        fwd=join(IN_DIR, "{sample}_R1.fastq.gz"),
        rev=join(IN_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    log:
        join(LOG_DIR, "cutadapt_{sample}.log")
    shell:
        "scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"

6. A DADA2 rule to filter FASTQs

We will run the script scripts/fastq_filter.R to filter and trim our Cutadapt-processed FASTQ files (this will remove poor-quality reads and bases).

This script takes two arguments:

The name of the file with forward reads (in our case, as output by rule cutadapt).
The output directory.

The script processes one pair of FASTQ files, like the Cutadapt rule. It outputs a pair of unzipped FASTQ files with the same names as the input files, just in a different directory (FILTER_DIR, in our case) and with a .fastq instead of a .fastq.gz extension.

The full script scripts/fastq_filter.R

#!/usr/bin/env Rscript

# This script will run the DADA2 function "filterAndTrim" to remove poor-quality
# reads and bases from FASTQ files by filtering and trimming them.
# The script has as input a pair (F and R reads for one sample) of,
# optionally gzipped, FASTQ files, and as output a pair of not-zipped FASTQ files. 

# Set-up -----------------------------------------------------------------------

# Install and/or load packages:
if(!require(pacman)) {
  install.packages("pacman", repos = "http://cran.us.r-project.org")
  library(pacman)
}
packages <- c('BiocManager', 'dada2')
pacman::p_load(char = packages)

# Get command-line args:
args <- commandArgs(trailingOnly = TRUE)
R1_in <- as.character(args[1])
outdir <- as.character(args[2])

# Determine R2 file name - assumed to be same as R1 but with "R2" instead:
R2_in <- sub("_R1", "_R2", R1_in)

# Determine sample ID - file name up until "_R1":
sampleID <- sub("_R1.*", "", basename(R1_in))

# Set output file paths:
R1_out <- file.path(outdir, paste0(sampleID, '_R1.fastq'))
R2_out <- file.path(outdir, paste0(sampleID, '_R2.fastq'))

# Report:
cat("Output dir:", outdir, "\n\n")
cat("R1_in:", R1_in, "\n")
cat("R2_in:", R2_in, "\n")
cat("R1_out:", R1_out, "\n")
cat("R2_out:", R2_out, "\n")


# Filtering and Quality Trimming -----------------------------------------------

print('Filtering and Trimming...')
Sys.time()  # Print the time to keep track of running time for individual steps

# Run the DADA2 function to filter and trim FASTQ files
filter_results <-
  filterAndTrim(
      R1_in, R1_out,          # Forward (F) reads input/output
      R2_in, R2_out,          # Reverse (R) reads input/output
      truncLen = c(250,210),  # Truncate F at 250 and R at 210 bp (remove last bases)
      trimLeft = 10,          # Remove first 10 bases
      maxN = 0,               # Don't allow for any Ns
      maxEE = c(2,2),         # Max. expected nr of errors of 2 in F and 2 in R
      truncQ = 2,             # Truncate when base quality score is 2 or less
      rm.phix = FALSE,        # Don't remove Phi-X
      multithread = 1,        # No multithreading
      compress = FALSE,       # Don't compress output files
      verbose = TRUE          # Verbose runtime info
  )

print('...Done!')
Sys.time()

print('Done with script fastq_filter.R')

Write rule fastq_filter.
Run rule fastq_filter for one sample and check the output.

Solutions

            rule fastq_filter: """ class="co">    A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files. class="co">    """ input: fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"), rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"), output: fwd=join(FILTER_DIR, "{sample}_R1.fastq"), rev=join(FILTER_DIR, "{sample}_R2.fastq"), log: join(LOG_DIR, "fastq_filter_{sample}.log") shell: """ class="co">        module load R/4.0.2-gnu9.1 class="co">        scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log} class="co">        """
 class="sourceCode bash">$ snakemake --profile slurm_profile \ results/fastq_filter/201_R1.fastq

 class="sourceCode bash">$ ls -lh results/fastq_filter/ class="co">#> total 42M class="co">#> -rw-r--r-- 1 jelmer PAS0471 23M Apr 21 11:47 201_R1_filt.fastq class="co">#> -rw-r--r-- 1 jelmer PAS0471 19M Apr 21 11:47 201_R2_filt.fastq

7. A DADA2 rule to infers ASVs

Finally, we will create a rule that will run the script scripts/ASV_infer.R, which takes as its input all filtered FASTQ files at once, and outputs a file containing a matrix of counts of ASVs (Amplicon Sequence Variants) for each sample.

For this rule, we will tinker a bit with the resource request to SLURM. This script is more computationally intensive that the previous ones, which makes sense because it will process all files at once.

Specifically, the DADA2 functions that we run in the script can take an argument multithread, specifying the number of cores/threads that can be used. Therefore, scripts/ASV_infer.R also takes such an argument, which will be passed on the the DADA2 functions.

How can we make sure Snakemake allocated 8 cores/threads for this job?

Solution.

By using a threads directive in the rule:

threads: 8

And how can we make sure Snakemake passes a request of 8 cores/threads to SLURM for this job?

Solution.

By using a resources directive in the rule, with a key cpus that matches what we have specified in slurm_profile/config.yaml:

resources: cpus=8

Here is the relevant matching part of slurm_profile/config.yaml:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=4000, time_min=60, account=PAS1855]

Additionally, we will request for an appropriate amount of memory³, so our full resources directive will look like this:

The script scripts/ASV_infer.R takes three arguments in total in the following order: 1. An input directory – i.e., the output dir of the previous rule. 2. An output filename – you can tell from rule all what that should be. 3. A number of cores/threads for the job – which you can pass using {threads}.

Write rule ASV_infer.
Run the rule.

Hints

Because this is the final rule in the workflow, whose output is specified as the input of rule all, running Snakemake without arguments will do the job.
Note that the previous two rules will also automatically be run for the two remaining samples.

Solutions

rule ASV_infer:
    """
    A rule to run an R script that contains DADA2 functions to infer ASVs,
    and produce a matrix with ASV counts for each sample.
    """
    threads: 8
    resources: cpus=8, mem_mb=32000
    input:
        expand(
            "{filter_dir}/{sample}_{read}.fastq",
            sample=SAMPLES,
            read=READS,
            filter_dir=FILTER_DIR,
        ),
    output:
        join(ASV_DIR, "ASV_table.tsv")
    log:
        join(LOG_DIR, "ASV_infer.log")
    shell:
        """
        module load R/4.0.2-gnu9.1
        scripts/ASV_infer.R {FILTER_DIR} {output} {threads} &> {log}
        """

$ snakemake --profile slurm_profile

$ ls -lh results/ASV_infer/
#> -rw-r--r-- 1 jelmer PAS0471 426K Apr 21 12:57 ASV_table.tsv

Our final Snakefile

(You can also find the final workflow files at /fs/ess/PAS1855/exercises/week13/bonus/final_workflow.)

Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎
Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎
Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎

Bonus Exercise: Week 13 – Snakemake

1. Set-up

2. Our initial workflow

Current Snakefile

Our “profile” `config.yaml` file

2. A config file with workflow settings

3. A FastQC rule

4. A MultiQC rule

5. A Cutadapt rule

6. A DADA2 rule to filter FASTQs

7. A DADA2 rule to infers ASVs

Our final Snakefile

Reuse

Bonus Exercise: Week 13 – Snakemake

1. Set-up

2. Our initial workflow

Current Snakefile

Our “profile” config.yaml file

2. A config file with workflow settings

3. A FastQC rule

4. A MultiQC rule

5. A Cutadapt rule

6. A DADA2 rule to filter FASTQs

7. A DADA2 rule to infers ASVs

Our final Snakefile

Reuse

Our “profile” `config.yaml` file