We did not work with real data in the Snakemake workflows we created so far. Therefore, in this bonus exercise, you’ll create a workflow that processes actual FASTQ files, which should be useful practice if you intend to use Snakemake for your final project or your own research.
We’ll use the same 6 FASTQ files that we explored in the exercises for week 5 and the second graded assignment. These 6 FASTQ files consist of 3 pairs of R1
(forward reads) and R2
(reverse reads) files for 3 biological samples that contain sequences from the V4 region of 16S rRNA, generated in a metabarcoding experiment.
In the workflow, we will perform QC with FastQC and MultiQC, remove primers with Cutadapt, and filter FASTQ files and infer Amplicon Sequence Variants (ASVs) with the R package DADA2.
You can copy the data and a project skeleton from /fs/ess/PAS1855/exercises/
:
cd /fs/ess/PAS1855/users/$USER/week13
cp -r /fs/ess/PAS1855/exercises/week13/bonus/initial_workflow bonus_exercise
# Go into the exercise dir:
cd bonus_exercise
# Make sure all the scripts are executable:
chmod +x scripts/*
Take a look at the files we have:
$ tree
#> .
#> ├── data
#> │ ├── 201_R1.fastq.gz
#> │ ├── 201_R2.fastq.gz
#> │ ├── 202_R1.fastq.gz
#> │ ├── 202_R2.fastq.gz
#> │ ├── 203_R1.fastq.gz
#> │ └── 203_R2.fastq.gz
#> ├── log
#> ├── scripts
#> │ ├── cutadapt.sh
#> │ ├── fastq_filter.R
#> │ ├── infer_ASVs.R
#> │ └── multiqc.sh
#> ├── slurm_profile
#> │ └── config.yaml
#> └── workflow
#> ├── config.yaml
#> ├── envs
#> └── Snakefile
Finally, load the Conda environment in which you have Snakemake installed:
module load python/3.6-conda5.2
source activate ipy-env
# Check if Snakemake can be found and is inside your ipy-env:
which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemake
Here is a DAG (Directed Acyclic Graph) of the full workflow that we will build:
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths
from os.path import join
# Settings:
="data"
IN_DIR="results"
OUT_DIR="log"
LOG_DIR
"GAGTGYCAGCMGCCGCGGTAA"
PRIMER_F: "ACGGACTACNVGGGTWTCTAAT"
PRIMER_R:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS
# All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"rm -rf results;"
"rm -rf log/*"
Our current Snakefile has a couple of features we haven’t seen before:
We use generic Python code directly in the Snakefile: we import the join()
function from the os.path
module, and further down, use this function to build file paths: 1
from os.path import join
=join(OUT_DIR, "fastqc") FASTQC_DIR
We have a rule clean
, which is a common convenience rule to remove all output and log files from the workflow. As you are developing and testing a workflow, you may want to do this regularly.
In the shell
directive of rule clean
, we have two lines of shell code, which works because we use Python triple-quotes.
Also of note:
In rule all
, the output files of the final workflow are already specified, so we won’t have to change this rule in the exercises below.
We will run the workflow on the cluster in the sense that each Snakemake job will be a SLURM job. Because rule all
and rule clean
they are not computationally intensive at all, we specify them as “local rules” (i.e., rules not to be submitted as SLURM jobs) using thelocalrules
directive. (The overhead of submitting a SLURM job and waiting for it start would just slow thing down for those rules.)
config.yaml
fileWe use a “profile” slurm_profile/config.yaml
file very similar to the one in a previous Snakemake exercise:
cluster: "sbatch --account={resources.account}
--time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=4000, time_min=60, account=PAS1855]
latency-wait: 30
jobs: 100
use-conda: true
This file tells Snakemake how to submit jobs to the cluster, and provides some default resource values that we would be able to modify for individual rules (as we will eventually do for rule ASV_infer
).
Recall that this file simply contains options that we could also pass at the command line. We can therefore also take the opportunity to specify the maximum number of jobs (which we normally do with the -j
option) and ensure that Snakemake creates and uses any Conda environments specified in the Snakefile (use-conda: true
; corresponds to the --use-conda
option).
Then, we don’t have to specify this at the command line when we call Snakemake, and the only option we need to pass to Snakemake is --profile
. For this option, we specify the directory (not the file name, which is assumed to be config.yaml
by Snakemake) in which our YAML file is found: snakemake --profile slurm_profile
.
Instead of specifying settings like input and output directories as well as primer sequences in our Snakefile, it would be useful to separate such settings from the workflow itself. Snakemake supports the use of a configuration file to do so.
Let’s create a configuration file workflow/config.yaml
with the following settings:
in_dir: "data"
out_dir: "results"
log_dir: "log"
primer_f: "GAGTGYCAGCMGCCGCGGTAA"
primer_r: "ACGGACTACNVGGGTWTCTAAT"
The format of the file is YAML just like the slurm_profile/config.yaml
above and the files we have used to specify and save Conda environments.
Now, we need a way to tell Snakemake about this config file. We can do so using the configfile
directive in the Snakefile. After we do that, all key-value pairs from the config file will be available in the Snakefile in a Python dictionary called config
:
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from the config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"] LOG_DIR
Replace the 5 lines of code in the Snakefile under the # Settings:
line with the lines printed directly above this box.
Then, assign the primer sequences in config.yaml
to the variables PRIMER_F
and PRIMER_R
, respectively.
Assign the primer sequences to variables PRIMER_F
and PRIMER_R
:
=config["primer_f"]
PRIMER_F=config["primer_r"] PRIMER_R
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths
from os.path import join
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"]
LOG_DIR
=config["primer_f"]
PRIMER_F=config["primer_r"]
PRIMER_R
# Other constants, mostly based on config file:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS # All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"rm -rf results;"
"rm -rf log/*"
Now, let’s add a rule to run FastQC:
rule fastqc:"""
A rule to run FastQC to QC the FASTQ files.
"""
input:
"{sample}_{read}.fastq.gz"),
join(IN_DIR,
output:"{sample}_{read}_fastqc.html"),
join(FASTQC_DIR,
log:"fastqc/{sample}_{read}.log"),
join(LOG_DIR,
shell:"""
module load fastqc/0.11.8
fastqc {input} -o {FASTQC_DIR} &> {log}
"""
Some things to note about this rule:
Previously we have always run scripts in the shell
directive, but in rule fastqc
, we directly run the program FastQC in the shell
directive.
Because FastQC doesn’t allow us to specify the output file, we just provide an output directory. Note that in the shell
directive, we can access any Snakefile variable using {}
(not just values of directives like {input}
and {log}
). Therefore, we can retrieve the FastQC output dir with {FASTQC_DIR}
.
In the input
and output
directives, we are using two wildcards: {sample}
and {read}
.
To avoid excessive installations, we load the OSC module for FastQC, here. If we had wanted increased portability, we could have specified a Conda environment using a YAML file like workflow/envs/fastqc.yaml
with the following contents:
channels:
- bioconda
dependencies:
- fastqc=0.11.8
When we have Snakemake submit jobs to the cluster, and we also specify log files with the log
directive, we will have two log files for every job:
The log file that we define in the Snakefile will contain standard output and/or standard error from our actual shell
command.
The SLURM log file will contain’s Snakemake’s standard output and standard error.
Add rule fastqc
, with the contents as specified above, to the Snakefile.
Get Snakemake to run rule fastqc
for at least one of the FASTQ files.
Check for the presence of output files and look at the log files.
Our rule all
input does not match the output of rule fastqc
, so trying to run Snakemake without speciying a rule or output file(s) won’t work (recall that by default, Snakemake will run the first rule, which is rule all
). We could add this to rule all
, but since we want to add a rule MultiQC in the next step, which uses the output of FastQC, that would only be temporary.
Moreover, rule fastqc
contains wildcards in its {input}
and {output}
directives, so trying to run it using snakemake [...] fastqc
would result in an error: Snakemake is not able to figure out what it should run (you can most conveniently try this with a dry-run using snakemake -n fastqc
).
snakemake -n fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
To test whether rule functions, it is therefore most convenient to just call Snakemake with an actual output file.
$ snakemake --profile slurm_profile \
results/fastqc/201_R1_fastqc.html
$ ls -lh results/fastqc/
#> total 1.9M
#> -rw-r--r-- 1 jelmer PAS0471 1007K Apr 21 09:19 201_R1_fastqc.html
#> -rw-r--r-- 1 jelmer PAS0471 847K Apr 21 09:19 201_R1_fastqc.zip
# This file will show FastQC's output:
$ cat log/fastqc_201_R1.log
#> Started analysis of 201_R1.fastq.gz
#> Approx 5% complete for 201_R1.fastq.gz
#> Approx 10% complete for 201_R1.fastq.gz
#> Approx 15% complete for 201_R1.fastq.gz
#> Approx 20% complete for 201_R1.fastq.gz
#> Approx 25% complete for 201_R1.fastq.gz
#> Approx 30% complete for 201_R1.fastq.gz
#> Approx 35% complete for 201_R1.fastq.gz
#> Approx 40% complete for 201_R1.fastq.gz
#> Approx 45% complete for 201_R1.fastq.gz
#> Approx 50% complete for 201_R1.fastq.gz
#> Approx 55% complete for 201_R1.fastq.gz
#> Approx 60% complete for 201_R1.fastq.gz
#> Approx 65% complete for 201_R1.fastq.gz
#> Approx 70% complete for 201_R1.fastq.gz
#> Approx 75% complete for 201_R1.fastq.gz
#> Approx 80% complete for 201_R1.fastq.gz
#> Approx 85% complete for 201_R1.fastq.gz
#> Approx 90% complete for 201_R1.fastq.gz
#> Approx 95% complete for 201_R1.fastq.gz
#> Analysis complete for 201_R1.fastq.gz
$ cat log/fastqc_slurm-read\=R1\,sample\=201.out
#> Building DAG of jobs...
#> Using shell: /usr/bin/bash
#> Provided cores: 1 (use --cores to define parallelism)
#> Rules claiming more threads will be scaled down.
#> Job counts:
#> count jobs
#> 1 fastqc
#> 1
#>
#> [Wed Apr 21 09:19:25 2021]
#> rule fastqc:
#> input: data/201_R1.fastq.gz
#> output: results/fastqc/201_R1_fastqc.html
#> log: log/fastqc_201_R1.log
#> jobid: 0
#> wildcards: sample=201, read=R1
#> resources: cpus=1, mem_mb=4000, time_min=60, account=PAS1855
#>
#> Activating conda environment: #> /fs/ess/PAS1855/data/week13/exercise_bonus/step2/.snakemake/conda/7e57eac489af13d2bb58773ba0fd9#> c38
#> [Wed Apr 21 09:19:36 2021]
#> Finished job 0.
#> 1 of 1 steps (100%) done
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths
from os.path import join
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"]
LOG_DIR
=config["primer_f"]
PRIMER_F=config["primer_r"]
PRIMER_R
# Other constants, mostly based on config file:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS # All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"rm -rf results;"
"rm -rf log/*"
rule fastqc:"""
A rule to run FastQC to QC the FASTQ files.
"""
conda:"envs/fastqc.yaml"
input:
"{sample}_{read}.fastq.gz"),
join(IN_DIR,
output:"{sample}_{read}_fastqc.html"),
join(FASTQC_DIR,
log:"fastqc_{sample}_{read}.log"),
join(LOG_DIR,
shell:"fastqc {input} -o {FASTQC_DIR} &> {log}"
We will add a rule that runs MultiQC, which summarizes multiple FastQC reports into a single HTML file.
We will run a shell script that in turn runs MultiQC, rather than running MultiQC directly in the shell
directive like we did for FastQC.
The main reason for this is an annoyance with installing MultiQC through Conda2: when using a YAML file or attempting a one-step installation, Conda will fail. Instead, we need to do a two-step installation: first, we install Python, and then, we install MultiQC. The scripts/multiqc.sh
script that rule multiqc
runs will create this environment if it doesn’t already exist, before running MultiQC.
scripts/multiqc.sh
#!/bin/bash
set -e -u -o pipefail
module load python/3.6-conda5.2
# Process command-line args:
in_dir=$1
out_dir=$2
# Report:
echo "# This script will run MultiQC."
echo "# Input directory: $in_dir"
echo -e "# Output directory: $out_dir \n"
# Activate/install and activate MultiQC conda environment:
if [[ $(conda env list) = *"multiqc-env"* ]]; then
echo "# Activating existing Conda environment multiqc-env"
source activate multiqc-env
else
echo "# Installing MultiQC in Conda environment multiqc-env"
conda create -y -n multiqc-env python=3.7
source activate multiqc-env
conda install -y -c bioconda -c conda-forge multiqc
fi
# Run MultiQC:
echo -e "\n# Running MultiQC..."
multiqc "$in_dir" -o "$out_dir"
echo -e "\n# Done with script."
Build a rule multiqc
to run MultiQC with the script scripts/multiqc.sh
.
Recall that MultiQC will run once with all FastQC output files as its input. The output file will be automatically called multiqc_report.html
so the output
directive should be:
output:"multiqc_report.html"), join(MULTIQC_DIR,
You’ll still need to come up with input
, log
, and shell
directives for the rule.
Use the expand()
function in the input
directive. Besides expanding the possible values for SAMPLES
and READS
, you will also need to include the FastQC output dir here, which is saved in the variable FASTQC_DIR
.
In the shell
directive, you will call the script scripts/multiqc.sh
. Take a look at the script to see what arguments it takes, and specify those.
Like in rule fastqc
, send standard output and standard error from the script to {log}
.
rule multiqc:"""
A rule to run MultiQC to collate the FastQC results.
"""
input:
expand("{fastqc_dir}/{sample}_{read}_fastqc.html",
=FASTQC_DIR,
fastqc_dir=SAMPLES,
sample=READS,
read
),
output:"multiqc_report.html"),
join(MULTIQC_DIR,
log:"multiqc.log"),
join(LOG_DIR,
shell:"scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"
rule fastqc
will also be run for the remaining 5 FASTQ files, since the input of rule multiqc
is the output of rule fastqc
.
$ snakemake --profile slurm_profile multiqc
$ ls -lh results/multiqc/
#> total 1.2M
#> drwxr-xr-x 2 jelmer PAS0471 4.0K Apr 21 10:38 multiqc_data
#> -rw-r--r-- 1 jelmer PAS0471 1.2M Apr 21 10:38 multiqc_report.html
$ cat log/multiqc.log
#> # This script will run MultiQC.
#> # Input directory: results/fastqc
#> # Output directory: results/multiqc
#>
#> # Activating existing Conda environment multiqc-env
#>
#> # Running MultiQC...
#> [WARNING] multiqc : MultiQC Version v1.10.1 now available!
#> [INFO ] multiqc : This is MultiQC v1.10
#> [INFO ] multiqc : Template : default
#> [INFO ] multiqc : Searching : #> /fs/project/PAS0471/teach/courses/pracs-sp21/week13/exercise_bonus/step3_multiqc/results/fastqc
#> Searching ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 12/12 [INFO ] fastqc : #> Found 6 reports
#> [INFO ] multiqc : Compressing plot data
#> [INFO ] multiqc : Report : results/multiqc/multiqc_report.html
#> [INFO ] multiqc : Data : results/multiqc/multiqc_data
#> [INFO ] multiqc : MultiQC complete
#>
#> # Done with script.
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths
from os.path import join
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"]
LOG_DIR
=config["primer_f"]
PRIMER_F=config["primer_r"]
PRIMER_R
# Other constants, mostly based on config file:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS # All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"""
rm -rf results
rm -rf log/*
"""
rule fastqc:"""
A rule to run FastQC to QC the FASTQ files.
"""
input:
"{sample}_{read}.fastq.gz"),
join(IN_DIR,
output:"{sample}_{read}_fastqc.html"),
join(FASTQC_DIR,
log:"fastqc_{sample}_{read}.log"),
join(LOG_DIR,
shell:"""
module load fastqc/0.11.8
fastqc {input} -o {FASTQC_DIR} &> {log}
"""
rule multiqc:"""
A rule to run MultiQC to collate the FastQC results.
"""
input:
expand("{fastqc_dir}/{sample}_{read}_fastqc.html",
=FASTQC_DIR,
fastqc_dir=SAMPLES,
sample=READS,
read
),
output:"multiqc_report.html"),
join(MULTIQC_DIR,
log:"multiqc.log"),
join(LOG_DIR,
shell:"scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"
Next, we will add a rule to run Cutadapt, which will remove the primer sequences from our reads.
rule cutadapt:"""
A rule to run Cutadapt to remove primers from the FASTQ files.
"""
conda:"envs/cutadapt.yaml"
input:
=join(IN_DIR, "{sample}_R1.fastq.gz"),
fwd=join(IN_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
log:"cutadapt_{sample}.log")
join(LOG_DIR,
shell:"scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"
Some things to note about this rule:
We are processing one sample and therefore two FASTQ files (forward and reverse) at a time.
Again, we run a script (scripts/cutadapt.sh
) rather than calling the program directly in the rule. In this case, that’s the most convenient option because we will have the script compute the reverse complements of the primer sequences.
The script takes four arguments:
The output file names are automatically determined by the script.
scripts/cutadapt.sh
#!/bin/bash
set -e -u -o pipefail
# This script will run Cutadapt on a pair (R1+R2) of FASTQ files,
# removing primer sequences.
# Forward and reverse primer sequences are passed as argument to the script,
# and the script will compute the reverse complements of each.
# SETUP --------------------------------------------------------
R1_in="$1" # Path to the file with forward (R1) reads (R1 filename is inferred from this)
outdir="$2" # Output directory
primer_f="$3" # Sequence of the forward primer
primer_r="$4" # Sequence of the reverse primer
# Test if the correct number of args were passed to the script:
if [ ! "$#" = 4 ]; then
echo "Error: must provide 4 arguments; you provided $#"
echo "Usage: cutadapt_single.sh <input-FASTQ-R1> <outdir> <primer_f> <primer_r>"
exit 1
fi
# Get reverse primer sequences by reverse complementing:
primer_f_rc=$(echo "$primer_f" | tr ATCGYRKMBDHV TAGCRYMKVHDB | rev)
primer_r_rc=$(echo "$primer_r" | tr ATCGYRKMBDHV TAGCRYMKVHDB | rev)
# Infer R2 (reverse) filename: R1 and R2 are assumed to have the same name
# except for "_R1_" vs "_R1_":
R2_in=${R1_in/_R1/_R2}
# Check that the input dir is not the same as the output dir,
# or the input files will be overwritten:
indir=$(dirname "$R1_in")
if [ "$indir" = "$outdir" ]; then
echo "## ERROR: Input dir should not be the same as output dir" >&2
exit 1
fi
# Output files:
R1_out="$outdir"/$(basename "$R1_in" .fastq.gz).fastq.gz
R2_out="$outdir"/$(basename "$R2_in" .fastq.gz).fastq.gz
# Report:
echo -e "\n## Starting cutadapt script."
date
echo
echo "## Using the following parameters:"
echo "## Input file R1: $R1_in"
echo "## Output dir: $outdir"
echo "## Forward primer: $primer_f"
echo "## Reverse primer: $primer_r"
echo
echo "## Input file R2: $R2_in"
echo "## Output file R1: $R1_out"
echo "## Output file R2: $R2_out"
echo
echo "## Reverse complement of forward primer: $primer_f_rc"
echo "## Reverse complement of reverse primer: $primer_r_rc"
echo
# Test:
if [ ! -f "$R1_in" ] || [ ! -r "$R1_in" ]; then
echo -e "\n## $0: ERROR: Input file R1_in, $R1_in, not found\n" >&2 && exit 1
fi
if [ ! -f "$R2_in" ] || [ ! -r "$R2_in" ]; then
echo -e "\n## $0: ERROR: Input file R2_in, $R2_in, not found\n" >&2 && exit 1
fi
# Create output directory if it doesn't already exist:
mkdir -p "$outdir"
# RUN CUTADAPT --------------------------------------------------------
echo -e "\n\n## Running cutadapt..."
cutadapt -a "$primer_f"..."$primer_r_rc" -A "$primer_r"..."$primer_f_rc" \
--discard-untrimmed --pair-filter=any \
-o "$R1_out" -p "$R2_out" "$R1_in" "$R2_in"
# Options:
# "-a"/"-A": Primers for R1/R2
# "--discard-untrimmed": Remove pairs with no primer found
# "--pair-filter=any": Remove pair if one read is filtered (=Default)
# REPORT AND FINALIZE --------------------------------------------------------
echo -e "\n## Listing output files:"
ls -lh "$outdir"
echo -e "\n## Done with cutadapt script."
date
This time, we will provide Snakemake with a YAML file describing a Conda environment for Cutadapt. Here is the contents of the YAML file envs/cutadapt.yaml
:
channels:
- bioconda
dependencies:
- cutadapt=3.4
Add the rule cutadapt
with the contents as specified above.
Get Snakemake to run rule cutadapt
for at least one of the FASTQ files. Note that this will take a little while the first time around, since Snakemake will create a Conda environment with Cutadapt.
Check for the presence of output files and look at the log files.
rule fastqc
, you’ll need to specify an individual output file.$ snakemake --profile slurm_profile results/cutadapt/201_R1.fastq.gz
#> Building DAG of jobs...
#> Creating conda environment workflow/envs/cutadapt.yaml...
#> Downloading and installing remote packages.
#> [...]
$ ls -lh results/cutadapt/
#> -rw-r--r-- 1 jelmer PAS0471 6.0M Apr 21 11:25 201_R1.fastq.gz
#> -rw-r--r-- 1 jelmer PAS0471 7.1M Apr 21 11:25 201_R2.fastq.gz
$ less log/cutadapt_201.log
#> ## Starting cutadapt script.
#> Wed Apr 21 11:25:00 EDT 2021
#>
#> ## Using the following parameters:
#> ## Input file R1: data/201_R1.fastq.gz
#> ## Output dir: results/cutadapt
#> ## Forward primer: GAGTGYCAGCMGCCGCGGTAA
#> ## Reverse primer: ACGGACTACNVGGGTWTCTAAT
#>
#> ## Input file R2: data/201_R2.fastq.gz
#> ## Output file R1: results/cutadapt/201_R1.fastq.gz
#> ## Output file R2: results/cutadapt/201_R2.fastq.gz
#>
#> ## Reverse complement of forward primer: TTACCGCGGCKGCTGRCACTC
#> ## Reverse complement of reverse primer: ATTAGAWACCCBNGTAGTCCGT
#>
#>
#> ## Running cutadapt...
#> This is cutadapt 3.4 with Python 3.9.2
#> Command line parameters: -a GAGTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCCGT -A #> ACGGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCACTC --discard-untrimmed --pair-filter=any -o #> results/cutadapt/201_R1.fastq.gz -p results/cutadapt/201_R2.fastq.gz #> data/201_R1.fastq.gz data/201_R2.fastq.gz
#> Processing reads on 1 core in paired-end mode ...
#> Finished in 4.84 s (103 µs/read; 0.58 M reads/minute).
#>
#> === Summary ===
#>
#> Total read pairs processed: 47,070
#> Read 1 with adapter: 47,061 (100.0%)
#> Read 2 with adapter: 46,675 (99.2%)
#> Pairs written (passing filters): 46,671 (99.2%)
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths
from os.path import join
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"]
LOG_DIR
=config["primer_f"]
PRIMER_F=config["primer_r"]
PRIMER_R
# Other constants, mostly based on config file:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS # All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"""
rm -rf results
rm -rf log/*
"""
rule fastqc:"""
A rule to run FastQC to QC the FASTQ files.
"""
input:
"{sample}_{read}.fastq.gz"),
join(IN_DIR,
output:"{sample}_{read}_fastqc.html"),
join(FASTQC_DIR,
log:"fastqc_{sample}_{read}.log"),
join(LOG_DIR,
shell:"""
module load fastqc/0.11.8
fastqc {input} -o {FASTQC_DIR} &> {log}
"""
rule multiqc:"""
A rule to run MultiQC to collate the FastQC results.
"""
input:
expand("{fastqc_dir}/{sample}_{read}_fastqc.html",
=FASTQC_DIR,
fastqc_dir=SAMPLES,
sample=READS,
read
),
output:"multiqc_report.html"),
join(MULTIQC_DIR,
log:"multiqc.log"),
join(LOG_DIR,
shell:"scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"
rule cutadapt:"""
A rule to run Cutadapt to remove primers from the FASTQ files.
"""
conda:"envs/cutadapt.yaml"
input:
=join(IN_DIR, "{sample}_R1.fastq.gz"),
fwd=join(IN_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
log:"cutadapt_{sample}.log")
join(LOG_DIR,
shell:"scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"
We will run the script scripts/fastq_filter.R
to filter and trim our Cutadapt-processed FASTQ files (this will remove poor-quality reads and bases).
This script takes two arguments:
The name of the file with forward reads (in our case, as output by rule cutadapt
).
The output directory.
The script processes one pair of FASTQ files, like the Cutadapt rule. It outputs a pair of unzipped FASTQ files with the same names as the input files, just in a different directory (FILTER_DIR
, in our case) and with a .fastq
instead of a .fastq.gz
extension.
scripts/fastq_filter.R
#!/usr/bin/env Rscript
# This script will run the DADA2 function "filterAndTrim" to remove poor-quality
# reads and bases from FASTQ files by filtering and trimming them.
# The script has as input a pair (F and R reads for one sample) of,
# optionally gzipped, FASTQ files, and as output a pair of not-zipped FASTQ files.
# Set-up -----------------------------------------------------------------------
# Install and/or load packages:
if(!require(pacman)) {
install.packages("pacman", repos = "http://cran.us.r-project.org")
library(pacman)
}<- c('BiocManager', 'dada2')
packages ::p_load(char = packages)
pacman
# Get command-line args:
<- commandArgs(trailingOnly = TRUE)
args <- as.character(args[1])
R1_in <- as.character(args[2])
outdir
# Determine R2 file name - assumed to be same as R1 but with "R2" instead:
<- sub("_R1", "_R2", R1_in)
R2_in
# Determine sample ID - file name up until "_R1":
<- sub("_R1.*", "", basename(R1_in))
sampleID
# Set output file paths:
<- file.path(outdir, paste0(sampleID, '_R1.fastq'))
R1_out <- file.path(outdir, paste0(sampleID, '_R2.fastq'))
R2_out
# Report:
cat("Output dir:", outdir, "\n\n")
cat("R1_in:", R1_in, "\n")
cat("R2_in:", R2_in, "\n")
cat("R1_out:", R1_out, "\n")
cat("R2_out:", R2_out, "\n")
# Filtering and Quality Trimming -----------------------------------------------
print('Filtering and Trimming...')
Sys.time() # Print the time to keep track of running time for individual steps
# Run the DADA2 function to filter and trim FASTQ files
<-
filter_results filterAndTrim(
# Forward (F) reads input/output
R1_in, R1_out, # Reverse (R) reads input/output
R2_in, R2_out, truncLen = c(250,210), # Truncate F at 250 and R at 210 bp (remove last bases)
trimLeft = 10, # Remove first 10 bases
maxN = 0, # Don't allow for any Ns
maxEE = c(2,2), # Max. expected nr of errors of 2 in F and 2 in R
truncQ = 2, # Truncate when base quality score is 2 or less
rm.phix = FALSE, # Don't remove Phi-X
multithread = 1, # No multithreading
compress = FALSE, # Don't compress output files
verbose = TRUE # Verbose runtime info
)
print('...Done!')
Sys.time()
print('Done with script fastq_filter.R')
Write rule fastq_filter
.
Run rule fastq_filter
for one sample and check the output.
rule fastq_filter:"""
A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files.
"""
input:
=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(FILTER_DIR, "{sample}_R1.fastq"),
fwd=join(FILTER_DIR, "{sample}_R2.fastq"),
rev
log:"fastq_filter_{sample}.log")
join(LOG_DIR,
shell:"""
module load R/4.0.2-gnu9.1
scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log}
"""
$ snakemake --profile slurm_profile \
results/fastq_filter/201_R1.fastq
$ ls -lh results/fastq_filter/
#> total 42M
#> -rw-r--r-- 1 jelmer PAS0471 23M Apr 21 11:47 201_R1_filt.fastq
#> -rw-r--r-- 1 jelmer PAS0471 19M Apr 21 11:47 201_R2_filt.fastq
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths:
from os.path import join
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"]
LOG_DIR
=config["primer_f"]
PRIMER_F=config["primer_r"]
PRIMER_R
# Other constants, mostly based on config file:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS # All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"""
rm -rf results
rm -rf log/*
"""
rule fastqc:"""
A rule to run FastQC to QC the FASTQ files.
"""
input:
"{sample}_{read}.fastq.gz"),
join(IN_DIR,
output:"{sample}_{read}_fastqc.html"),
join(FASTQC_DIR,
log:"fastqc_{sample}_{read}.log"),
join(LOG_DIR,
shell:"""
module load fastqc/0.11.8
fastqc {input} -o {FASTQC_DIR} &> {log}
"""
rule multiqc:"""
A rule to run MultiQC to collate the FastQC results.
"""
input:
expand("{fastqc_dir}/{sample}_{read}_fastqc.html",
=FASTQC_DIR,
fastqc_dir=SAMPLES,
sample=READS,
read
),
output:"multiqc_report.html"),
join(MULTIQC_DIR,
log:"multiqc.log"),
join(LOG_DIR,
shell:"scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"
rule cutadapt:"""
A rule to run Cutadapt to remove primers from the FASTQ files.
"""
conda:"envs/cutadapt.yaml"
input:
=join(IN_DIR, "{sample}_R1.fastq.gz"),
fwd=join(IN_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
log:"cutadapt_{sample}.log")
join(LOG_DIR,
shell:"scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"
rule fastq_filter:"""
A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files.
"""
input:
=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(FILTER_DIR, "{sample}_R1.fastq"),
fwd=join(FILTER_DIR, "{sample}_R2.fastq"),
rev
log:"fastq_filter_{sample}.log")
join(LOG_DIR,
shell:"""
module load R/4.0.2-gnu9.1
scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log}
"""
Finally, we will create a rule that will run the script scripts/ASV_infer.R
, which takes as its input all filtered FASTQ files at once, and outputs a file containing a matrix of counts of ASVs (Amplicon Sequence Variants) for each sample.
scripts/ASV_infer.R
#!/usr/bin/env Rscript
# This script will run several steps of the DADA2 pipeline to infer
# Amplicon Sequence Variants (ASVs) and their abundance in a set of samples.
# The input is a set of FASTQ files from multiple samples
# (as automatically detected given an input dir, which is one of the arguments),
# and the output is a single table with ASV counts per sample.
# Set-up ----------------------------------------------------------------------
<- commandArgs(trailingOnly = TRUE)
args <- as.character(args[1]) # Input dir. Input files will be detected.
indir <- as.character(args[2]) # Ouput file: An ASV table.
outfile <- as.integer(args[3]) # Number of cores to use.
n_cores
# Install and/or load packages:
if(!require(pacman)) {
install.packages("pacman", repos = "http://cran.us.r-project.org")
library(pacman)
}<- c('BiocManager', 'dada2')
packages ::p_load(char = packages)
pacman
# Detect input files given the input directory passed as an argument:
<- sort(list.files(indir, pattern = '_R1.fastq', full.names = TRUE))
fastqs_filt_F <- sort(list.files(indir, pattern = '_R2.fastq', full.names = TRUE))
fastqs_filt_R
# Infer Sample IDs from FASTQ file names:
<- sub("_R1.*", "", basename(fastqs_filt_F))
sampleIDs
# Report:
cat("Input dir:", indir, "\n")
cat("Output file:", outfile, "\n")
cat("Number of threads/cores:", n_cores, "\n")
cat("Sample IDs:", sampleIDs, "\n")
# Dereplication ----------------------------------------------------------------
print('Dereplicating FASTQs...')
Sys.time() # Keep track of time to see how how long each step takes
<- derepFastq(fastqs_filt_F, verbose = FALSE)
fastqs_derep_F <- derepFastq(fastqs_filt_R, verbose = FALSE)
fastqs_derep_R
names(fastqs_derep_F) <- sampleIDs
names(fastqs_derep_R) <- sampleIDs
print('...Done!')
Sys.time()
# Error training -------------------------------------------------------------
print('Learning errors...')
<- learnErrors(fastqs_derep_F, multithread = n_cores, verbose = TRUE)
err_F <- learnErrors(fastqs_derep_R, multithread = n_cores, verbose = TRUE)
err_R
print('...Done!')
Sys.time()
# Infer ASVs -------------------------------------------------------------
print('Inferring ASVs (running the dada algorithm)...')
<- dada(fastqs_derep_F, err = err_F,
dada_Fs pool = FALSE, multithread = n_cores)
<- dada(fastqs_derep_R, err = err_R,
dada_Rs pool = FALSE, multithread = n_cores)
print('...Done!')
Sys.time()
# Merge read pairs ------------------------------------------------------------
print('Merging read pairs...')
<- mergePairs(dada_Fs, fastqs_derep_F,
mergers
dada_Rs, fastqs_derep_R,verbose = TRUE)
# Construct a sequence table --------------------------------------------------
print('Constructing a sequence table...')
<- makeSequenceTable(mergers)
seqtab_all
# Remove chimeras ------------------------------------------------------------
print('Remove chimeras...')
<- removeBimeraDenovo(seqtab_all,
seqtab method = 'consensus',
multithread = n_cores,
verbose = TRUE)
# Write to file ----------------------------------------------------------------
print('Write table to file...')
write.table(seqtab, outfile,
sep = "\t", row.names = TRUE, quote = FALSE)
print('Done with script ASV_infer.R')
For this rule, we will tinker a bit with the resource request to SLURM. This script is more computationally intensive that the previous ones, which makes sense because it will process all files at once.
Specifically, the DADA2 functions that we run in the script can take an argument multithread
, specifying the number of cores/threads that can be used. Therefore, scripts/ASV_infer.R
also takes such an argument, which will be passed on the the DADA2 functions.
How can we make sure Snakemake allocated 8 cores/threads for this job?
By using a threads
directive in the rule:
8 threads:
And how can we make sure Snakemake passes a request of 8 cores/threads to SLURM for this job?
By using a resources
directive in the rule, with a key cpus
that matches what we have specified in slurm_profile/config.yaml
:
=8 resources: cpus
Here is the relevant matching part of slurm_profile/config.yaml
:
cluster: "sbatch --account={resources.account}
--time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=4000, time_min=60, account=PAS1855]
Additionally, we will request for an appropriate amount of memory3, so our full resources
directive will look like this:
=8, mem_mb=32000 resources: cpus
The script scripts/ASV_infer.R
takes three arguments in total in the following order: 1. An input directory – i.e., the output dir of the previous rule. 2. An output filename – you can tell from rule all
what that should be. 3. A number of cores/threads for the job – which you can pass using {threads}
.
Write rule ASV_infer
.
Run the rule.
Because this is the final rule in the workflow, whose output is specified as the input of rule all
, running Snakemake without arguments will do the job.
Note that the previous two rules will also automatically be run for the two remaining samples.
rule ASV_infer:"""
A rule to run an R script that contains DADA2 functions to infer ASVs,
and produce a matrix with ASV counts for each sample.
"""
8
threads: =8, mem_mb=32000
resources: cpusinput:
expand("{filter_dir}/{sample}_{read}.fastq",
=SAMPLES,
sample=READS,
read=FILTER_DIR,
filter_dir
),
output:"ASV_table.tsv")
join(ASV_DIR,
log:"ASV_infer.log")
join(LOG_DIR,
shell:"""
module load R/4.0.2-gnu9.1
scripts/ASV_infer.R {FILTER_DIR} {output} {threads} &> {log}
"""
$ snakemake --profile slurm_profile
$ ls -lh results/ASV_infer/
#> -rw-r--r-- 1 jelmer PAS0471 426K Apr 21 12:57 ASV_table.tsv
Let’s see an overview of all output files of the workflow:
$ tree results
#> results
#> ├── ASV_infer
#> │ └── ASV_table.tsv
#> ├── cutadapt
#> │ ├── 201_R1.fastq.gz
#> │ ├── 201_R2.fastq.gz
#> │ ├── 202_R1.fastq.gz
#> │ ├── 202_R2.fastq.gz
#> │ ├── 203_R1.fastq.gz
#> │ └── 203_R2.fastq.gz
#> ├── fastqc
#> │ ├── 201_R1_fastqc.html
#> │ ├── 201_R1_fastqc.zip
#> │ ├── 201_R2_fastqc.html
#> │ ├── 201_R2_fastqc.zip
#> │ ├── 202_R1_fastqc.html
#> │ ├── 202_R1_fastqc.zip
#> │ ├── 202_R2_fastqc.html
#> │ ├── 202_R2_fastqc.zip
#> │ ├── 203_R1_fastqc.html
#> │ ├── 203_R1_fastqc.zip
#> │ ├── 203_R2_fastqc.html
#> │ └── 203_R2_fastqc.zip
#> ├── fastq_filter
#> │ ├── 201_R1.fastq
#> │ ├── 201_R2.fastq
#> │ ├── 202_R1.fastq
#> │ ├── 202_R2.fastq
#> │ ├── 203_R1.fastq
#> │ └── 203_R2.fastq
#> └── multiqc
#> ├── multiqc_data
#> │ ├── multiqc_data.json
#> │ ├── multiqc_fastqc.txt
#> │ ├── multiqc_general_stats.txt
#> │ ├── multiqc.log
#> │ └── multiqc_sources.txt
#> └── multiqc_report.html
(You can also find the final workflow files at /fs/ess/PAS1855/exercises/week13/bonus/final_workflow
.)
"""
A workflow to perform QC and basic analyses on a set of FASTQ files from
metagenomic amplicon sequencing.
The following steps are performed:
1. FastQC on the raw FASTQ files
2. MultiQC on the output of FastQC
3. Cutadapt to remove primers on the raw FASTQ files
4. DADA2 FASTQ filtering
5. DADA2 ASV inference, producing a single ASV output table
"""
# We'll use the `join` function to create paths:
from os.path import join
# Workflow configuration file:
"workflow/config.yaml"
configfile:
# Get settings from config file:
=config["in_dir"]
IN_DIR=config["out_dir"]
OUT_DIR=config["log_dir"]
LOG_DIR
=config["primer_f"]
PRIMER_F=config["primer_r"]
PRIMER_R
# Other constants, mostly based on config file:
# Samples and reads:
= glob_wildcards(join(IN_DIR, "{sample}_R1.fastq.gz")).sample
SAMPLES = ["R1", "R2"]
READS # All output dirs:
=join(OUT_DIR, "fastqc")
FASTQC_DIR=join(OUT_DIR, "multiqc")
MULTIQC_DIR=join(OUT_DIR, "cutadapt")
CUTADAPT_DIR=join(OUT_DIR, "fastq_filter")
FILTER_DIR=join(OUT_DIR, "ASV_infer")
ASV_DIR
# Define local rules:
all, clean
localrules:
# Rules:
all:
rule """
A pseudo-rule that triggers execution of the entire workflow.
It's input files include all final output files.
"""
input:
"multiqc_report.html"),
join(MULTIQC_DIR, "ASV_table.tsv")
join(ASV_DIR,
rule clean:"""
A rule to remove all results from the workflow.
"""
shell:"""
rm -rf results
rm -rf log/*
"""
rule fastqc:"""
A rule to run FastQC to QC the FASTQ files.
"""
input:
"{sample}_{read}.fastq.gz"),
join(IN_DIR,
output:"{sample}_{read}_fastqc.html"),
join(FASTQC_DIR,
log:"fastqc_{sample}_{read}.log"),
join(LOG_DIR,
shell:"""
module load fastqc/0.11.8
fastqc {input} -o {FASTQC_DIR} &> {log}
"""
rule multiqc:"""
A rule to run MultiQC to collate the FastQC results.
"""
input:
expand("{fastqc_dir}/{sample}_{read}_fastqc.html",
=FASTQC_DIR,
fastqc_dir=SAMPLES,
sample=READS,
read
),
output:"multiqc_report.html"),
join(MULTIQC_DIR,
log:"multiqc.log"),
join(LOG_DIR,
shell:"scripts/multiqc.sh {FASTQC_DIR} {MULTIQC_DIR} &> {log}"
rule cutadapt:"""
A rule to run Cutadapt to remove primers from the FASTQ files.
"""
conda:"envs/cutadapt.yaml"
input:
=join(IN_DIR, "{sample}_R1.fastq.gz"),
fwd=join(IN_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
log:"cutadapt_{sample}.log")
join(LOG_DIR,
shell:"scripts/cutadapt.sh {input.fwd} {CUTADAPT_DIR} {PRIMER_F} {PRIMER_R} &> {log}"
rule fastq_filter:"""
A rule to run an R script that contains DADA2 functions to filter and trim FASTQ files.
"""
input:
=join(CUTADAPT_DIR, "{sample}_R1.fastq.gz"),
fwd=join(CUTADAPT_DIR, "{sample}_R2.fastq.gz"),
rev
output:=join(FILTER_DIR, "{sample}_R1.fastq"),
fwd=join(FILTER_DIR, "{sample}_R2.fastq"),
rev
log:"fastq_filter_{sample}.log")
join(LOG_DIR,
shell:"""
module load R/4.0.2-gnu9.1
scripts/fastq_filter.R {input.fwd} {FILTER_DIR} &> {log}
"""
rule ASV_infer:"""
A rule to run an R script that contains DADA2 functions to infer ASVs,
and produce a matrix with ASV counts for each sample.
"""
8
threads: =8, mem_mb=32000
resources: cpusinput:
expand("{filter_dir}/{sample}_{read}.fastq",
=SAMPLES,
sample=READS,
read=FILTER_DIR,
filter_dir
),
output:"ASV_table.tsv")
join(ASV_DIR,
log:"ASV_infer.log")
join(LOG_DIR,
shell:"""
module load R/4.0.2-gnu9.1
scripts/ASV_infer.R {FILTER_DIR} {output} {threads} &> {log}
"""
Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎
Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎
Each core has about 4 GB of memory so 8 * 4 = 32 GB makes sense here. Because our default resource request is for 4000 MB (4 GB), we would otherwise be requesting 4 GB total, even with 8 cores.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".