Bonus Exercise: Week 13 – Snakemake


We did not work with real data in the Snakemake workflows we created so far. Therefore, in this bonus exercise, you’ll create a workflow that processes actual FASTQ files, which should be useful practice if you intend to use Snakemake for your final project or your own research.

We’ll use the same 6 FASTQ files that we explored in the exercises for week 5 and the second graded assignment. These 6 FASTQ files consist of 3 pairs of R1 (forward reads) and R2 (reverse reads) files for 3 biological samples that contain sequences from the V4 region of 16S rRNA, generated in a metabarcoding experiment.

In the workflow, we will perform QC with FastQC and MultiQC, remove primers with Cutadapt, and filter FASTQ files and infer Amplicon Sequence Variants (ASVs) with the R package DADA2.

1. Set-up

You can copy the data and a project skeleton from /fs/ess/PAS1855/exercises/:

cd /fs/ess/PAS1855/users/$USER/week13
cp -r /fs/ess/PAS1855/exercises/week13/bonus/initial_workflow bonus_exercise

# Go into the exercise dir:
cd bonus_exercise

# Make sure all the scripts are executable:
chmod +x scripts/*

Take a look at the files we have:

$ tree
#> .
#> ├── data
#> │   ├── 201_R1.fastq.gz
#> │   ├── 201_R2.fastq.gz
#> │   ├── 202_R1.fastq.gz
#> │   ├── 202_R2.fastq.gz
#> │   ├── 203_R1.fastq.gz
#> │   └── 203_R2.fastq.gz
#> ├── log
#> ├── scripts
#> │   ├── cutadapt.sh
#> │   ├── fastq_filter.R
#> │   ├── infer_ASVs.R
#> │   └── multiqc.sh
#> ├── slurm_profile
#> │   └── config.yaml
#> └── workflow
#>     ├── config.yaml
#>     ├── envs
#>     └── Snakefile

Finally, load the Conda environment in which you have Snakemake installed:

module load python/3.6-conda5.2
source activate ipy-env

# Check if Snakemake can be found and is inside your ipy-env:
which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemake

Here is a DAG (Directed Acyclic Graph) of the full workflow that we will build:


2. Our initial workflow

Full Snakefile #1
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Settings:
IN_DIR="data"
OUT_DIR="results"
LOG_DIR="log"

PRIMER_F: "GAGTGYCAGCMGCCGCGGTAA"
PRIMER_R: "ACGGACTACNVGGGTWTCTAAT"

# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]

# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        "rm -rf results;"
        "rm -rf log/*"

Current Snakefile

Our current Snakefile has a couple of features we haven’t seen before:

Also of note:

Our “profile” config.yaml file

We use a “profile” slurm_profile/config.yaml file very similar to the one in a previous Snakemake exercise:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=4000, time_min=60, account=PAS1855]
latency-wait: 30
jobs: 100
use-conda: true

This file tells Snakemake how to submit jobs to the cluster, and provides some default resource values that we would be able to modify for individual rules (as we will eventually do for rule ASV_infer).

Recall that this file simply contains options that we could also pass at the command line. We can therefore also take the opportunity to specify the maximum number of jobs (which we normally do with the -j option) and ensure that Snakemake creates and uses any Conda environments specified in the Snakefile (use-conda: true; corresponds to the --use-conda option).

Then, we don’t have to specify this at the command line when we call Snakemake, and the only option we need to pass to Snakemake is --profile. For this option, we specify the directory (not the file name, which is assumed to be config.yaml by Snakemake) in which our YAML file is found: snakemake --profile slurm_profile.


2. A config file with workflow settings

Instead of specifying settings like input and output directories as well as primer sequences in our Snakefile, it would be useful to separate such settings from the workflow itself. Snakemake supports the use of a configuration file to do so.

Let’s create a configuration file workflow/config.yaml with the following settings:

in_dir: "data"
out_dir: "results"
log_dir: "log"

primer_f: "GAGTGYCAGCMGCCGCGGTAA"
primer_r: "ACGGACTACNVGGGTWTCTAAT"

The format of the file is YAML just like the slurm_profile/config.yaml above and the files we have used to specify and save Conda environments.

Now, we need a way to tell Snakemake about this config file. We can do so using the configfile directive in the Snakefile. After we do that, all key-value pairs from the config file will be available in the Snakefile in a Python dictionary called config:

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from the config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

Replace the 5 lines of code in the Snakefile under the # Settings: line with the lines printed directly above this box.

Then, assign the primer sequences in config.yaml to the variables PRIMER_F and PRIMER_R, respectively.

Solution

Assign the primer sequences to variables PRIMER_F and PRIMER_R:

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]


Full Snakefile #2
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        "rm -rf results;"
        "rm -rf log/*"


3. A FastQC rule

Now, let’s add a rule to run FastQC:

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc/{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

Some things to note about this rule:

When we have Snakemake submit jobs to the cluster, and we also specify log files with the log directive, we will have two log files for every job:

Hints

Our rule all input does not match the output of rule fastqc, so trying to run Snakemake without speciying a rule or output file(s) won’t work (recall that by default, Snakemake will run the first rule, which is rule all). We could add this to rule all, but since we want to add a rule MultiQC in the next step, which uses the output of FastQC, that would only be temporary.

Moreover, rule fastqc contains wildcards in its {input} and {output} directives, so trying to run it using snakemake [...] fastqc would result in an error: Snakemake is not able to figure out what it should run (you can most conveniently try this with a dry-run using snakemake -n fastqc).

snakemake -n fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

To test whether rule functions, it is therefore most convenient to just call Snakemake with an actual output file.

Solution
$ snakemake --profile slurm_profile \
    results/fastqc/201_R1_fastqc.html
$ ls -lh results/fastqc/
#> total 1.9M
#> -rw-r--r-- 1 jelmer PAS0471 1007K Apr 21 09:19 201_R1_fastqc.html
#> -rw-r--r-- 1 jelmer PAS0471  847K Apr 21 09:19 201_R1_fastqc.zip

# This file will show FastQC's output:  
$ cat log/fastqc_201_R1.log
#> Started analysis of 201_R1.fastq.gz
#> Approx 5% complete for 201_R1.fastq.gz
#> Approx 10% complete for 201_R1.fastq.gz
#> Approx 15% complete for 201_R1.fastq.gz
#> Approx 20% complete for 201_R1.fastq.gz
#> Approx 25% complete for 201_R1.fastq.gz
#> Approx 30% complete for 201_R1.fastq.gz
#> Approx 35% complete for 201_R1.fastq.gz
#> Approx 40% complete for 201_R1.fastq.gz
#> Approx 45% complete for 201_R1.fastq.gz
#> Approx 50% complete for 201_R1.fastq.gz
#> Approx 55% complete for 201_R1.fastq.gz
#> Approx 60% complete for 201_R1.fastq.gz
#> Approx 65% complete for 201_R1.fastq.gz
#> Approx 70% complete for 201_R1.fastq.gz
#> Approx 75% complete for 201_R1.fastq.gz
#> Approx 80% complete for 201_R1.fastq.gz
#> Approx 85% complete for 201_R1.fastq.gz
#> Approx 90% complete for 201_R1.fastq.gz
#> Approx 95% complete for 201_R1.fastq.gz
#> Analysis complete for 201_R1.fastq.gz

$ cat log/fastqc_slurm-read\=R1\,sample\=201.out 
#> Building DAG of jobs...
#> Using shell: /usr/bin/bash
#> Provided cores: 1 (use --cores to define parallelism)
#> Rules claiming more threads will be scaled down.
#> Job counts:
#>         count   jobs
#>         1       fastqc
#>         1
#> 
#> [Wed Apr 21 09:19:25 2021]
#> rule fastqc:
#>     input: data/201_R1.fastq.gz
#>     output: results/fastqc/201_R1_fastqc.html
#>     log: log/fastqc_201_R1.log
#>     jobid: 0
#>     wildcards: sample=201, read=R1
#>     resources: cpus=1, mem_mb=4000, time_min=60, account=PAS1855
#> 
#> Activating conda environment: #> /fs/ess/PAS1855/data/week13/exercise_bonus/step2/.snakemake/conda/7e57eac489af13d2bb58773ba0fd9#> c38
#> [Wed Apr 21 09:19:36 2021]
#> Finished job 0.
#> 1 of 1 steps (100%) done


Full Snakefile #3
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        "rm -rf results;"
        "rm -rf log/*"

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    conda:
        "envs/fastqc.yaml"
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        "fastqc {input} -o {FASTQC_DIR} &> {log}"


4. A MultiQC rule

We will add a rule that runs MultiQC, which summarizes multiple FastQC reports into a single HTML file.

We will run a shell script that in turn runs MultiQC, rather than running MultiQC directly in the shell directive like we did for FastQC.

The main reason for this is an annoyance with installing MultiQC through Conda2: when using a YAML file or attempting a one-step installation, Conda will fail. Instead, we need to do a two-step installation: first, we install Python, and then, we install MultiQC. The scripts/multiqc.sh script that rule multiqc runs will create this environment if it doesn’t already exist, before running MultiQC.

Show the script scripts/multiqc.sh
#!/bin/bash

set -e -u -o pipefail

module load python/3.6-conda5.2

# Process command-line args:
in_dir=$1
out_dir=$2

# Report:
echo "# This script will run MultiQC."
echo "# Input directory: $in_dir"
echo -e "# Output directory: $out_dir \n" 

# Activate/install and activate MultiQC conda environment:
if [[ $(conda env list) = *"multiqc-env"* ]]; then
    echo "# Activating existing Conda environment multiqc-env"
    source activate multiqc-env
else
    echo "# Installing MultiQC in Conda environment multiqc-env"
    conda create -y -n multiqc-env python=3.7
    source activate multiqc-env
    conda install -y -c bioconda -c conda-forge multiqc
fi

# Run MultiQC:
echo -e "\n# Running MultiQC..."
multiqc "$in_dir" -o "$out_dir"

echo -e "\n# Done with script."


Hints

  • Use the expand() function in the input directive. Besides expanding the possible values for SAMPLES and READS, you will also need to include the FastQC output dir here, which is saved in the variable FASTQC_DIR.

  • In the shell directive, you will call the script scripts/multiqc.sh. Take a look at the script to see what arguments it takes, and specify those.

  • Like in rule fastqc, send standard output and standard error from the script to {log}.

Solutions
rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"


Solutions

rule fastqc will also be run for the remaining 5 FASTQ files, since the input of rule multiqc is the output of rule fastqc.

$ snakemake --profile slurm_profile multiqc
$ ls -lh results/multiqc/
#> total 1.2M
#> drwxr-xr-x 2 jelmer PAS0471 4.0K Apr 21 10:38 multiqc_data
#> -rw-r--r-- 1 jelmer PAS0471 1.2M Apr 21 10:38 multiqc_report.html

$ cat log/multiqc.log
#> # This script will run MultiQC.
#> # Input directory: results/fastqc
#> # Output directory: results/multiqc 
#> 
#> # Activating existing Conda environment multiqc-env
#> 
#> # Running MultiQC...
#> [WARNING]         multiqc : MultiQC Version v1.10.1 now available!
#> [INFO   ]         multiqc : This is MultiQC v1.10
#> [INFO   ]         multiqc : Template    : default
#> [INFO   ]         multiqc : Searching   : #> /fs/project/PAS0471/teach/courses/pracs-sp21/week13/exercise_bonus/step3_multiqc/results/fastqc
#> Searching   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 12/12  [INFO   ]          fastqc : #> Found 6 reports
#> [INFO   ]         multiqc : Compressing plot data
#> [INFO   ]         multiqc : Report      : results/multiqc/multiqc_report.html
#> [INFO   ]         multiqc : Data        : results/multiqc/multiqc_data
#> [INFO   ]         multiqc : MultiQC complete
#> 
#> # Done with script.


Full Snakefile #4
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        """
        rm -rf results
        rm -rf log/*
        """

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"


5. A Cutadapt rule

Next, we will add a rule to run Cutadapt, which will remove the primer sequences from our reads.

rule cutadapt:
    """
    A rule to run Cutadapt to remove primers from the FASTQ files.
    """
    conda:
        "envs/cutadapt.yaml"
    input:
        fwd=join(IN_DIR, "{sample}_R1.fastq.gz"),
        rev=join(IN_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    log:
        join(LOG_DIR, "cutadapt_{sample}.log")
    shell:
        "scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"

Some things to note about this rule:

The script scripts/cutadapt.sh

#!/bin/bash
set -e -u -o pipefail

# This script will run Cutadapt on a pair (R1+R2) of FASTQ files,
# removing primer sequences.
# Forward and reverse primer sequences are passed as argument to the script,
# and the script will compute the reverse complements of each.

# SETUP --------------------------------------------------------
R1_in="$1"          # Path to the file with forward (R1) reads (R1 filename is inferred from  this)
outdir="$2"         # Output directory
primer_f="$3"       # Sequence of the forward primer
primer_r="$4"       # Sequence of the reverse primer

# Test if the correct number of args were passed to the script:
if [ ! "$#" = 4 ]; then
    echo "Error: must provide 4 arguments; you provided $#"
    echo "Usage: cutadapt_single.sh <input-FASTQ-R1> <outdir> <primer_f> <primer_r>"
    exit 1
fi

# Get reverse primer sequences by reverse complementing:
primer_f_rc=$(echo "$primer_f" | tr ATCGYRKMBDHV TAGCRYMKVHDB | rev)
primer_r_rc=$(echo "$primer_r" | tr ATCGYRKMBDHV TAGCRYMKVHDB | rev)

# Infer R2 (reverse) filename: R1 and R2 are assumed to have the same name
# except for "_R1_" vs "_R1_": 
R2_in=${R1_in/_R1/_R2}

# Check that the input dir is not the same as the output dir,
# or the input files will be overwritten:
indir=$(dirname "$R1_in")
if [ "$indir" = "$outdir" ]; then
    echo "## ERROR: Input dir should not be the same as output dir" >&2
    exit 1
fi

# Output files:
R1_out="$outdir"/$(basename "$R1_in" .fastq.gz).fastq.gz
R2_out="$outdir"/$(basename "$R2_in" .fastq.gz).fastq.gz

# Report:
echo -e "\n## Starting cutadapt script."
date
echo
echo "## Using the following parameters:"
echo "## Input file R1: $R1_in"
echo "## Output dir: $outdir"
echo "## Forward primer: $primer_f"
echo "## Reverse primer: $primer_r"
echo
echo "## Input file R2: $R2_in"
echo "## Output file R1: $R1_out"
echo "## Output file R2: $R2_out"
echo
echo "## Reverse complement of forward primer: $primer_f_rc"
echo "## Reverse complement of reverse primer: $primer_r_rc"
echo

# Test:
if [ ! -f "$R1_in" ] || [ ! -r "$R1_in" ]; then
    echo -e "\n## $0: ERROR: Input file R1_in, $R1_in, not found\n" >&2 && exit 1
fi
if [ ! -f "$R2_in" ] || [ ! -r "$R2_in" ]; then
    echo -e "\n## $0: ERROR: Input file R2_in, $R2_in, not found\n" >&2 && exit 1
fi

# Create output directory if it doesn't already exist:
mkdir -p "$outdir"

# RUN CUTADAPT --------------------------------------------------------
echo -e "\n\n## Running cutadapt..."

cutadapt -a "$primer_f"..."$primer_r_rc" -A "$primer_r"..."$primer_f_rc" \
    --discard-untrimmed --pair-filter=any \
    -o "$R1_out" -p "$R2_out" "$R1_in" "$R2_in"

# Options:
# "-a"/"-A": Primers for R1/R2
# "--discard-untrimmed": Remove pairs with no primer found
# "--pair-filter=any": Remove pair if one read is filtered (=Default)

# REPORT AND FINALIZE --------------------------------------------------------
echo -e "\n## Listing output files:"
ls -lh "$outdir"

echo -e "\n## Done with cutadapt script."
date


Hints

  • Like with rule fastqc, you’ll need to specify an individual output file.

Solutions
$ snakemake --profile slurm_profile results/cutadapt/201_R1.fastq.gz
#> Building DAG of jobs...
#> Creating conda environment workflow/envs/cutadapt.yaml...
#> Downloading and installing remote packages.
#> [...]
$ ls -lh results/cutadapt/
#> -rw-r--r-- 1 jelmer PAS0471 6.0M Apr 21 11:25 201_R1.fastq.gz
#> -rw-r--r-- 1 jelmer PAS0471 7.1M Apr 21 11:25 201_R2.fastq.gz

$ less log/cutadapt_201.log
#> ## Starting cutadapt script.
#> Wed Apr 21 11:25:00 EDT 2021
#> 
#> ## Using the following parameters:
#> ## Input file R1: data/201_R1.fastq.gz
#> ## Output dir: results/cutadapt
#> ## Forward primer: GAGTGYCAGCMGCCGCGGTAA
#> ## Reverse primer: ACGGACTACNVGGGTWTCTAAT
#> 
#> ## Input file R2: data/201_R2.fastq.gz
#> ## Output file R1: results/cutadapt/201_R1.fastq.gz
#> ## Output file R2: results/cutadapt/201_R2.fastq.gz
#> 
#> ## Reverse complement of forward primer: TTACCGCGGCKGCTGRCACTC
#> ## Reverse complement of reverse primer: ATTAGAWACCCBNGTAGTCCGT
#> 
#> 
#> ## Running cutadapt...
#> This is cutadapt 3.4 with Python 3.9.2
#> Command line parameters: -a GAGTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCCGT -A #> ACGGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCACTC --discard-untrimmed --pair-filter=any -o #> results/cutadapt/201_R1.fastq.gz -p results/cutadapt/201_R2.fastq.gz #> data/201_R1.fastq.gz data/201_R2.fastq.gz
#> Processing reads on 1 core in paired-end mode ...
#> Finished in 4.84 s (103 µs/read; 0.58 M reads/minute).
#> 
#> === Summary ===
#> 
#> Total read pairs processed:             47,070
#>   Read 1 with adapter:                  47,061 (100.0%)
#>   Read 2 with adapter:                  46,675 (99.2%)
#> Pairs written (passing filters):        46,671 (99.2%)


Full Snakefile #5
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        """
        rm -rf results
        rm -rf log/*
        """

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"

rule cutadapt:
    """
    A rule to run Cutadapt to remove primers from the FASTQ files.
    """
    conda:
        "envs/cutadapt.yaml"
    input:
        fwd=join(IN_DIR, "{sample}_R1.fastq.gz"),
        rev=join(IN_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    log:
        join(LOG_DIR, "cutadapt_{sample}.log")
    shell:
        "scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"


6. A DADA2 rule to filter FASTQs

We will run the script scripts/fastq_filter.R to filter and trim our Cutadapt-processed FASTQ files (this will remove poor-quality reads and bases).

This script takes two arguments:

  1. The name of the file with forward reads (in our case, as output by rule cutadapt).

  2. The output directory.

The script processes one pair of FASTQ files, like the Cutadapt rule. It outputs a pair of unzipped FASTQ files with the same names as the input files, just in a different directory (FILTER_DIR, in our case) and with a .fastq instead of a .fastq.gz extension.

The full script scripts/fastq_filter.R

#!/usr/bin/env Rscript

# This script will run the DADA2 function "filterAndTrim" to remove poor-quality
# reads and bases from FASTQ files by filtering and trimming them.
# The script has as input a pair (F and R reads for one sample) of,
# optionally gzipped, FASTQ files, and as output a pair of not-zipped FASTQ files. 

# Set-up -----------------------------------------------------------------------

# Install and/or load packages:
if(!require(pacman)) {
  install.packages("pacman", repos = "http://cran.us.r-project.org")
  library(pacman)
}
packages <- c('BiocManager', 'dada2')
pacman::p_load(char = packages)

# Get command-line args:
args <- commandArgs(trailingOnly = TRUE)
R1_in <- as.character(args[1])
outdir <- as.character(args[2])

# Determine R2 file name - assumed to be same as R1 but with "R2" instead:
R2_in <- sub("_R1", "_R2", R1_in)

# Determine sample ID - file name up until "_R1":
sampleID <- sub("_R1.*", "", basename(R1_in))

# Set output file paths:
R1_out <- file.path(outdir, paste0(sampleID, '_R1.fastq'))
R2_out <- file.path(outdir, paste0(sampleID, '_R2.fastq'))

# Report:
cat("Output dir:", outdir, "\n\n")
cat("R1_in:", R1_in, "\n")
cat("R2_in:", R2_in, "\n")
cat("R1_out:", R1_out, "\n")
cat("R2_out:", R2_out, "\n")


# Filtering and Quality Trimming -----------------------------------------------

print('Filtering and Trimming...')
Sys.time()  # Print the time to keep track of running time for individual steps

# Run the DADA2 function to filter and trim FASTQ files
filter_results <-
  filterAndTrim(
      R1_in, R1_out,          # Forward (F) reads input/output
      R2_in, R2_out,          # Reverse (R) reads input/output
      truncLen = c(250,210),  # Truncate F at 250 and R at 210 bp (remove last bases)
      trimLeft = 10,          # Remove first 10 bases
      maxN = 0,               # Don't allow for any Ns
      maxEE = c(2,2),         # Max. expected nr of errors of 2 in F and 2 in R
      truncQ = 2,             # Truncate when base quality score is 2 or less
      rm.phix = FALSE,        # Don't remove Phi-X
      multithread = 1,        # No multithreading
      compress = FALSE,       # Don't compress output files
      verbose = TRUE          # Verbose runtime info
  )

print('...Done!')
Sys.time()

print('Done with script fastq_filter.R')


Solutions

rule fastq_filter:
    """
    A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files.
    """
    input:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(FILTER_DIR, "{sample}_R1.fastq"),
        rev=join(FILTER_DIR, "{sample}_R2.fastq"),
    log:
        join(LOG_DIR, "fastq_filter_{sample}.log")
    shell:
        """
        module load R/4.0.2-gnu9.1
        scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log}
        """
$ snakemake --profile slurm_profile \
    results/fastq_filter/201_R1.fastq
$ ls -lh results/fastq_filter/
#> total 42M
#> -rw-r--r-- 1 jelmer PAS0471 23M Apr 21 11:47 201_R1_filt.fastq
#> -rw-r--r-- 1 jelmer PAS0471 19M Apr 21 11:47 201_R2_filt.fastq


Full Snakefile #6
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths:
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        """
        rm -rf results
        rm -rf log/*
        """

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"

rule cutadapt:
    """
    A rule to run Cutadapt to remove primers from the FASTQ files.
    """
    conda:
        "envs/cutadapt.yaml"
    input:
        fwd=join(IN_DIR, "{sample}_R1.fastq.gz"),
        rev=join(IN_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    log:
        join(LOG_DIR, "cutadapt_{sample}.log")
    shell:
        "scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"

rule fastq_filter:
    """
    A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files.
    """
    input:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(FILTER_DIR, "{sample}_R1.fastq"),
        rev=join(FILTER_DIR, "{sample}_R2.fastq"),
    log:
        join(LOG_DIR, "fastq_filter_{sample}.log")
    shell:
        """
        module load R/4.0.2-gnu9.1
        scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log}
        """


7. A DADA2 rule to infers ASVs

Finally, we will create a rule that will run the script scripts/ASV_infer.R, which takes as its input all filtered FASTQ files at once, and outputs a file containing a matrix of counts of ASVs (Amplicon Sequence Variants) for each sample.

The full script scripts/ASV_infer.R
#!/usr/bin/env Rscript

# This script will run several steps of the DADA2 pipeline to infer
# Amplicon Sequence Variants (ASVs) and their abundance in a set of samples.
# The input is a set of FASTQ files from multiple samples
# (as automatically detected given an input dir, which is one of the arguments),
# and the output is a single table with ASV counts per sample.


# Set-up ----------------------------------------------------------------------

args <- commandArgs(trailingOnly = TRUE)
indir <- as.character(args[1])           # Input dir. Input files will be detected.
outfile <- as.character(args[2])         # Ouput file: An ASV table.
n_cores <- as.integer(args[3])           # Number of cores to use.

# Install and/or load packages:
if(!require(pacman)) {
  install.packages("pacman", repos = "http://cran.us.r-project.org")
  library(pacman)
}
packages <- c('BiocManager', 'dada2')
pacman::p_load(char = packages)

# Detect input files given the input directory passed as an argument:
fastqs_filt_F <- sort(list.files(indir, pattern = '_R1.fastq', full.names = TRUE))
fastqs_filt_R <- sort(list.files(indir, pattern = '_R2.fastq', full.names = TRUE))

# Infer Sample IDs from FASTQ file names:
sampleIDs <- sub("_R1.*", "", basename(fastqs_filt_F))

# Report:
cat("Input dir:", indir, "\n")
cat("Output file:", outfile, "\n")
cat("Number of threads/cores:", n_cores, "\n")
cat("Sample IDs:", sampleIDs, "\n")


# Dereplication ----------------------------------------------------------------

print('Dereplicating FASTQs...')
Sys.time() # Keep track of time to see how how long each step takes

fastqs_derep_F <- derepFastq(fastqs_filt_F, verbose = FALSE)
fastqs_derep_R <- derepFastq(fastqs_filt_R, verbose = FALSE)

names(fastqs_derep_F) <- sampleIDs
names(fastqs_derep_R) <- sampleIDs

print('...Done!')
Sys.time()


# Error training  -------------------------------------------------------------

print('Learning errors...')

err_F <- learnErrors(fastqs_derep_F, multithread = n_cores, verbose = TRUE)
err_R <- learnErrors(fastqs_derep_R, multithread = n_cores, verbose = TRUE)

print('...Done!')
Sys.time()


# Infer ASVs  -------------------------------------------------------------

print('Inferring ASVs (running the dada algorithm)...')

dada_Fs <- dada(fastqs_derep_F, err = err_F,
                pool = FALSE, multithread = n_cores)
dada_Rs <- dada(fastqs_derep_R, err = err_R,
                pool = FALSE, multithread = n_cores)

print('...Done!')
Sys.time()


# Merge read pairs ------------------------------------------------------------

print('Merging read pairs...')

mergers <- mergePairs(dada_Fs, fastqs_derep_F,
                      dada_Rs, fastqs_derep_R,
                      verbose = TRUE)


# Construct a sequence table --------------------------------------------------

print('Constructing a sequence table...')

seqtab_all <- makeSequenceTable(mergers)


# Remove chimeras ------------------------------------------------------------

print('Remove chimeras...')

seqtab <- removeBimeraDenovo(seqtab_all,
                             method = 'consensus',
                             multithread = n_cores,
                             verbose = TRUE)

# Write to file ----------------------------------------------------------------

print('Write table to file...')

write.table(seqtab, outfile,
            sep = "\t", row.names = TRUE, quote = FALSE)

print('Done with script ASV_infer.R')


For this rule, we will tinker a bit with the resource request to SLURM. This script is more computationally intensive that the previous ones, which makes sense because it will process all files at once.

Specifically, the DADA2 functions that we run in the script can take an argument multithread, specifying the number of cores/threads that can be used. Therefore, scripts/ASV_infer.R also takes such an argument, which will be passed on the the DADA2 functions.

How can we make sure Snakemake allocated 8 cores/threads for this job?

Solution.

By using a threads directive in the rule:

threads: 8

And how can we make sure Snakemake passes a request of 8 cores/threads to SLURM for this job?

Solution.

By using a resources directive in the rule, with a key cpus that matches what we have specified in slurm_profile/config.yaml:

resources: cpus=8

Here is the relevant matching part of slurm_profile/config.yaml:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=4000, time_min=60, account=PAS1855]

Additionally, we will request for an appropriate amount of memory3, so our full resources directive will look like this:

resources: cpus=8, mem_mb=32000

The script scripts/ASV_infer.R takes three arguments in total in the following order: 1. An input directory – i.e., the output dir of the previous rule. 2. An output filename – you can tell from rule all what that should be. 3. A number of cores/threads for the job – which you can pass using {threads}.

Hints

  • Because this is the final rule in the workflow, whose output is specified as the input of rule all, running Snakemake without arguments will do the job.

  • Note that the previous two rules will also automatically be run for the two remaining samples.

Solutions

rule ASV_infer:
    """
    A rule to run an R script that contains DADA2 functions to infer ASVs,
    and produce a matrix with ASV counts for each sample.
    """
    threads: 8
    resources: cpus=8, mem_mb=32000
    input:
        expand(
            "{filter_dir}/{sample}_{read}.fastq",
            sample=SAMPLES,
            read=READS,
            filter_dir=FILTER_DIR,
        ),
    output:
        join(ASV_DIR, "ASV_table.tsv")
    log:
        join(LOG_DIR, "ASV_infer.log")
    shell:
        """
        module load R/4.0.2-gnu9.1
        scripts/ASV_infer.R {FILTER_DIR} {output} {threads} &> {log}
        """
$ snakemake --profile slurm_profile
$ ls -lh results/ASV_infer/
#> -rw-r--r-- 1 jelmer PAS0471 426K Apr 21 12:57 ASV_table.tsv

Let’s see an overview of all output files of the workflow:

$ tree results
#> results
#> ├── ASV_infer
#> │   └── ASV_table.tsv
#> ├── cutadapt
#> │   ├── 201_R1.fastq.gz
#> │   ├── 201_R2.fastq.gz
#> │   ├── 202_R1.fastq.gz
#> │   ├── 202_R2.fastq.gz
#> │   ├── 203_R1.fastq.gz
#> │   └── 203_R2.fastq.gz
#> ├── fastqc
#> │   ├── 201_R1_fastqc.html
#> │   ├── 201_R1_fastqc.zip
#> │   ├── 201_R2_fastqc.html
#> │   ├── 201_R2_fastqc.zip
#> │   ├── 202_R1_fastqc.html
#> │   ├── 202_R1_fastqc.zip
#> │   ├── 202_R2_fastqc.html
#> │   ├── 202_R2_fastqc.zip
#> │   ├── 203_R1_fastqc.html
#> │   ├── 203_R1_fastqc.zip
#> │   ├── 203_R2_fastqc.html
#> │   └── 203_R2_fastqc.zip
#> ├── fastq_filter
#> │   ├── 201_R1.fastq
#> │   ├── 201_R2.fastq
#> │   ├── 202_R1.fastq
#> │   ├── 202_R2.fastq
#> │   ├── 203_R1.fastq
#> │   └── 203_R2.fastq
#> └── multiqc
#>     ├── multiqc_data
#>     │   ├── multiqc_data.json
#>     │   ├── multiqc_fastqc.txt
#>     │   ├── multiqc_general_stats.txt
#>     │   ├── multiqc.log
#>     │   └── multiqc_sources.txt
#>     └── multiqc_report.html

Our final Snakefile

(You can also find the final workflow files at /fs/ess/PAS1855/exercises/week13/bonus/final_workflow.)

"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""

# We'll use the `join` function to create paths:
from os.path import join

# Workflow configuration file:
configfile: "workflow/config.yaml"

# Get settings from config file:
IN_DIR=config["in_dir"]
OUT_DIR=config["out_dir"]
LOG_DIR=config["log_dir"]

PRIMER_F=config["primer_f"]
PRIMER_R=config["primer_r"]

# Other constants, mostly based on config file:
# Samples and reads:
SAMPLES = glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
READS = ["R1", "R2"]
# All output dirs:
FASTQC_DIR=join(OUT_DIR, "fastqc")
MULTIQC_DIR=join(OUT_DIR, "multiqc")
CUTADAPT_DIR=join(OUT_DIR, "cutadapt")
FILTER_DIR=join(OUT_DIR, "fastq_filter")
ASV_DIR=join(OUT_DIR, "ASV_infer")

# Define local rules:
localrules: all, clean  

# Rules:
rule all:
    """
    A pseudo-rule that triggers execution of the entire workflow.
    It's input files include all final output files.
    """
    input:
        join(MULTIQC_DIR, "multiqc_report.html"),
        join(ASV_DIR, "ASV_table.tsv")

rule clean:
    """
    A rule to remove all results from the workflow.
    """
    shell:
        """
        rm -rf results
        rm -rf log/*
        """

rule fastqc:
    """
    A rule to run FastQC to QC the FASTQ files.
    """
    input:
        join(IN_DIR, "{sample}_{read}.fastq.gz"),
    output:
        join(FASTQC_DIR, "{sample}_{read}_fastqc.html"),
    log:
        join(LOG_DIR, "fastqc_{sample}_{read}.log"),
    shell:
        """
        module load fastqc/0.11.8
        fastqc {input} -o {FASTQC_DIR} &> {log}
        """

rule multiqc:
    """
    A rule to run MultiQC to collate the FastQC results.
    """
    input:
        expand(
            "{fastqc_dir}/{sample}_{read}_fastqc.html",
            fastqc_dir=FASTQC_DIR,
            sample=SAMPLES,
            read=READS,
        ),
    output:
        join(MULTIQC_DIR, "multiqc_report.html"),
    log:
        join(LOG_DIR, "multiqc.log"),
    shell:
        "scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"

rule cutadapt:
    """
    A rule to run Cutadapt to remove primers from the FASTQ files.
    """
    conda:
        "envs/cutadapt.yaml"
    input:
        fwd=join(IN_DIR, "{sample}_R1.fastq.gz"),
        rev=join(IN_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    log:
        join(LOG_DIR, "cutadapt_{sample}.log")
    shell:
        "scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"

rule fastq_filter:
    """
    A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files.
    """
    input:
        fwd=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
        rev=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
    output:
        fwd=join(FILTER_DIR, "{sample}_R1.fastq"),
        rev=join(FILTER_DIR, "{sample}_R2.fastq"),
    log:
        join(LOG_DIR, "fastq_filter_{sample}.log")
    shell:
        """
        module load R/4.0.2-gnu9.1
        scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log}
        """

rule ASV_infer:
    """
    A rule to run an R script that contains DADA2 functions to infer ASVs,
    and produce a matrix with ASV counts for each sample.
    """
    threads: 8
    resources: cpus=8, mem_mb=32000
    input:
        expand(
            "{filter_dir}/{sample}_{read}.fastq",
            sample=SAMPLES,
            read=READS,
            filter_dir=FILTER_DIR,
        ),
    output:
        join(ASV_DIR, "ASV_table.tsv")
    log:
        join(LOG_DIR, "ASV_infer.log")
    shell:
        """
        module load R/4.0.2-gnu9.1
        scripts/ASV_infer.R {FILTER_DIR} {output} {threads} &> {log}
        """



  1. Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎

  2. Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎

  3. Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".