Exercises: Week 13 – Snakemake


In these exercises, you’ll take up where we left off in class, using a dummy RNA-seq workflow.


1: Expanding and improving the workflow

1. Set-up

Load the Conda environment in which you have Snakemake installed:

$ module load python/3.6-conda5.2
$ source activate ipy-env

# Check if Snakemake can be found and is inside your ipy-env:
$ which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemake

We will recreate a very similar set of files and the final Snakefile from class – copy the files from /fs/ess/PAS1855/exercises/week13/ to do so:

$ cd /fs/ess/PAS1855/users/$USER/week13
$ cp -r /fs/ess/PAS1855/exercises/week13/exercise1 .

$ cd exercise1
$ tree
#> .
#> ├── data
#> │   ├── smpA.fastq
#> │   ├── smpC.fastq
#> │   └── smpG.fastq
#> ├── res
#> ├── scripts
#> │   ├── count.sh
#> │   ├── map.sh
#> │   └── trim.sh
#> └── workflow
#>     ├── DAGs
#>     └── Snakefile
Or use this code to create the dirs and files from scratch

"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.  
"""

SAMPLES=glob_wildcards("data/{smp}.fastq").smp

rule all:
    """
    This is a "pseudo-rule" meant to trigger the execution of the full workflow. 
    """
    input:
        "res/count_table.txt",

rule trim:
    """
    Trim the FASTQ files.
    """
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    shell:
        "scripts/trim.sh {input} > {output}"

rule map:
    """
    Map the trimmed FASTQ files to a reference genome.
    """
    input:
        "res/{smp}_trim.fastq",
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input} > {output}"

rule count:
    """
    From the BAM files, create a gene count table.
    """
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    shell:
        "scripts/count.sh {input} > {output}"


Currently, our Snakefile workflow/Snakefile contains the following:

Show Snakefile
"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.  
"""

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

rule all:
    """
    This is a "pseudo-rule" meant to trigger the execution of the full workflow. 
    """
    input:
        "res/count_table.txt",

rule trim:
    """
    Trim the FASTQ files.
    """
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    shell:
        "scripts/trim.sh {input} > {output}"

rule map:
    """
    Map the trimmed FASTQ files to a reference genome.
    """
    input:
        "res/{smp}_trim.fastq",
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input} > {output}"

rule count:
    """
    From the BAM files, create a gene count table.
    """
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    shell:
        "scripts/count.sh {input} > {output}"

Note that some formatting details in the Snakefile above are slightly different from what was shown in class. I used the Snakefmt VS Code extension, which lets a little program of the same name format Snakefiles for you.1

rule count:
    """
    From the BAM files, create a gene count table.
    """
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    shell:
        "scripts/count.sh {input} > {output}"

Snakefmt will detect syntax errors and ensure consistent and clear formatting, as shown for rule count above:

Furthermore, I have added some Python multi-line (triple-quote) comments explaining what the Snakefile and each individual rule does.


2. First DAG and Snakemake run

Solution

snakemake --dag | dot -T svg > workflow/DAGs/v1.svg 

Solution

Below, Snakemake is being run with the -p (print shell commands) and -q (quiet) options for concise output. To better understand what Snakemake is doing, you may want to see more output yourself, e.g. with the -pr combination of options (-r: reason for execution of each job).

Also, recall that you can always do a “dry run” with the -n option to see what Snakemake wants to execute.

snakemake -j1 -pq
#> Job counts:
#>         count   jobs
#>         1       all
#>         1       count
#>         3       map
#>         3       trim
#>         8
#> scripts/trim.sh data/smpG.fastq > res/smpG_trim.fastq
#> scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#> scripts/trim.sh data/smpC.fastq > res/smpC_trim.fastq
#> scripts/trim.sh data/smpA.fastq > res/smpA_trim.fastq
#> scripts/map.sh res/smpA_trim.fastq > res/smpA.bam
#> scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam > res/count_table.txt


3. Add an input file to the map rule

Currently, our map and count rules don’t include any genomic reference files: mapping should happen against a reference genome (FASTA file), and counting needs an annotation for the reference genome (GTF/GFF file).

For now, we will fix this oversight just for the mapping rule. Let’s create a dummy genome FASTA file:

mkdir metadata
echo "AAAAAAACCCCCCCGGGGGGG" > metadata/ref.fa

We’ll add a line to the map.sh script, such that the second argument passed to it is now the FASTA file with the reference genome:

echo 'echo "Using ref genome: $2" && cat $2' >> scripts/map.sh
Hints

To do so, you’ll need to do the following in the map rule:

Moreover, it is good practice to define hard-coded input files like this one clearly at the top of your script/Snakefile (or use a configuration file.

Solution

Solution


Solution


snakemake -j1 -pq

#> Job counts:
#>         count   jobs
#>         1       all
#>         1       count
#>         3       map
#>         5
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa > res/smpA.bam
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa > res/smpG.bam
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa > res/smpC.bam
#> scripts/count.sh res/smpC.bam res/smpA.bam res/smpG.bam > res/count_table.txt


4. Add a FastQC rule

As you may recall, FastQC performs quality control of FASTQ files (see these slides from week 6). FastQC should be run on the raw and/or trimmed FASTQ files. If you want, you can implement both, but the solution below will do so just for the raw FASTQ files.

#!/bin/bash
set -e -u -o pipefail

infile=$1
outdir=$2

outfile="$outdir"/$(basename "$infile" .fastq).fastqc.html

echo "Running FastQC for file: $infile , to output dir: $outdir"

echo "FastQC results for file $infile" > "$outfile"
cat "$infile" > "$outfile"
Solution

chmod u+x scripts/fastqc.sh

Hints

Solution

rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    shell:
        "scripts/fastqc.sh {input} res"


5. Run the FastQC rule

Solution

It will not run anything:

snakemake -j1 -pr
#> Building DAG of jobs...
#> Nothing to be done.

This is because the FastQC output files are not being used by any other rule, nor are they listed in rule all.


Solution

You will get a WorkflowError:

snakemake -j1 -pr fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

The target rule referenced in the error is fastqc, which we targeted in the command-line call.

The problem is that rule fastqc contains wildcards and Snakemake can’t resolve what their values should be. This may be surprising but recall that Snakemake works backwards in the sense that it looks at output files first (rather than first inferring input files from existing files), and here, it does not know which output files to produce.


Solution

You could specify a specific output file for one of the samples, e.g. res/smpA.fastqc.html:

snakemake -j1 -pq res/smpA.fastqc.html
#> Job counts:
#>         count   jobs
#>         1       fastqc
#>         1
#> scripts/fastqc.sh data/smpA.fastq res
#> Running FastQC for file: data/smpA.fastq , to output dir: res

Note above that the line the script printed to standard output (Running FastQC for file: $infile , to output dir: $outdir) is indeed simply printed to standard out among the other output of Snakemake.


Hints

You should include the output files of the FastQC script in the input of rule all, using the expand() function.

Solution


6. When we get errors

Solution


The actual error produced by the bash script is what you would likely be interested in if troubleshooting:

scripts/map.sh: line 6: $2: unbound variable.

However, this was buried among all the other Snakemake output. Earlier, we saw that a line of standard output was similarly printed to screen among the Snakemake output. And the same would be true for logging information and errors produced by programs like the actual FastQC.

This is not optimal when errors do occur – in the next section, we will therefore use log files.

Solution

Nope! These were removed by Snakemake because of the errors. You may have noticed this in the error messages above, where it said, for example:

#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bam


7. The log directive

It would be better to save logging and error output to “log” files for each rule – or each job within a rule, more precisely.

For example, we may be inclined to do the following for a script/program that:

But should we also tell Snakemake explicitly about such a file, like we do for input and output files? Yes, but it’s best to use a separate directive for this: log.

This is convenient because Snakemake treats log files differently than regular output files: if a job fails, Snakemake will not delete the log files. As we saw above, Snakemake does delete regular output files by jobs that produced errors, so if we were to designate logging output as regular output, we would not be able to examine our errors.

Example usage of a log key – note that we can use any wildcards that are also used in {output}:

rule myscript:
    #[...]
    output: res/{sample}.out
    log: log/myscript_{sample}.log
    shell: "myscript.py -i {input} > {output} 2> {log}"


Solution

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

REF_FA = "metadata/ref.fa"

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)


rule trim:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    log:
        "log/trim/{smp}.log",
    shell:
        "scripts/trim.sh {input} >{output} 2>{log}"


rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


rule count:
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    log:
        "log/count/count.log",
    shell:
        "scripts/count.sh {input} >{output} 2>{log}"


rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    log:
        "log/fastqc/{smp}.log",
    shell:
        "scripts/fastqc.sh {input} res &>{log}"
rm res/*

snakemake -j1 -pq

#> Job counts:
#>         count   jobs
#>         1       all
#>         1       count
#>         3       fastqc
#>         3       map
#>         3       trim
#>         11
#> scripts/fastqc.sh data/smpG.fastq res &>log/fastqc/smpG.log
#> scripts/fastqc.sh data/smpA.fastq res &>log/fastqc/smpA.log
#> scripts/trim.sh data/smpG.fastq >res/smpG_trim.fastq 2>log/trim/smpG.log
#> scripts/trim.sh data/smpC.fastq >res/smpC_trim.fastq 2>log/trim/smpC.log
#> scripts/trim.sh data/smpA.fastq >res/smpA_trim.fastq 2>log/trim/smpA.log
#> scripts/fastqc.sh data/smpC.fastq res &>log/fastqc/smpC.log
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa >res/smpA.bam 2>log/map/smpA.log
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa >res/smpG.bam 2>log/map/smpG.log
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa >res/smpC.bam 2>log/map/smpC.log
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam >res/count_table.txt #> 2>log/count/count.log
tree log

#> log
#> ├── count
#> │   └── count.log
#> ├── fastqc
#> │   ├── smpA.log
#> │   ├── smpC.log
#> │   └── smpG.log
#> ├── map
#> │   ├── smpA.log
#> │   ├── smpC.log
#> │   └── smpG.log
#> └── trim
#>     ├── smpA.log
#>     ├── smpC.log
#>     └── smpG.log


2: Running the workflow with SLURM jobs

1. Introduction

Since we are running dummy scripts, it has been fine to run Snakemake on a login node at OSC. For any real workflow, this would not work.

You have two main options:

2. Submit Snakemake as a single SLURM job

Solution

#!/bin/bash

#SBATCH --account=PAS1855
#SBATCH --time=60
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

set -e -u -o pipefail

snakemake -j1 -p


So far, we have been using -j1 (i.e. --jobs 1), such that Snakemake will only run one process at a time. Several of our rules run the same script separately for each sample, so those jobs could be run at the same time (in parallel).

Hints

For the last question: you should use a SLURM environment variables. The environment variable in question is $SLURM_NTASKS.

Solution


Solution

rm -r res log/*

sbatch workflow/snakemake_submit.sh


3. Let Snakemake submit jobs for us

At its most basic, all you need to do is add --cluster sbatch to the Snakemake command, and Snakemake will automatically submit a SLURM job for each of its jobs. But recall that at OSC, we also always need to specify an account.

So, we could use the following to have Snakemake submit at most 3 jobs at a time:

snakemake -j6 --cluster "sbatch --account=PAS1855"


4. Use a “profile”

For simple and quick workflows, providing SLURM options on the command line like we did above is reasonable, but if we have several non-default options, this will start to become a painfully long command.

For instance, you may note that the SLURM log files have currently all ended up in your project’s root dir, where you probably don’t want them.

We can also specify these options in a configuration file using a “profile”. This will additionally make it easier to customize SLURM options on a per-rule basis.

To use a profile, we should first create a directory. The name of this directory is arbitrary, but since we are using it in the context of providing settings for SLURM jobs, something like slurm_profile would make sense:

mkdir slurm_profile

Inside this directory, Snakemake will expect a file called config.yaml:

touch slurm_profile/config.yaml

In config.yaml, we can provide a string for the cluster key, much like we did previously with the --cluster argument on the command-line:

cluster: "sbatch --account={resources.account}
                 --mem={resources.mem_mb}
                 --output=log/slurm-{rule}_{wildcards}.out"

default-resources: [mem_mb=4000, account=PAS1855]

Above, we use {resources.<resource-name>} instead of actual values in the cluster key. Then, the values for each <resource-name> are specified for the default-resources key. This setup is convenient because it allows us to also refer to the same resources in the Snakefile to set rule-specific values, as we’ll see a little later.

Solution

Contents of slurm_profile/config.yaml:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]


To run Snakemake with the profile, we need to use the option --profile, and specify our profile dir: --profile <profile_dir>.

Solution

rm -r res log/*

snakemake -j6 --profile slurm_profile


Above, we also specified an output dir and names for our SLURM log files. If this directory does not already exist, make sure you create it manually, because Snakemake will not create it, and mysterious failures will occur!

So, Snakemake will create directories specified in the Snakefile as output, but won’t create directories that are only mentioned in this config file.


4. Add options to the profile config file

config.yaml can contain not just cluster settings, but anything that can be set with command-line options.

We can take that opportunity to also:

Hints

Note that in YAML, key-value pairs are specified as key:value, not --key value.

Solution

jobs: 100
latency-wait: 30

The full slurm_profile/config.yaml should now contain the following:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
jobs: 100
latency-wait: 30


Pay attention to the job submission pattern of Snakemake. You can also open up another terminal and monitor your jobs:

squeue -u $USER -l
Solution

rm -r res log/*

snakemake --profile slurm_profile


5. Local rules

In practice, you may not want to submit a cluster job for every rule.

For instance, rule all is a Snakemake job, but it doesn’t actually run anything the way we have set it up. Additionally, you may have some very lightweight cleaning/logging rules.

To tell Snakemake that certain rules should not be submitted to the cluster, include a comma-separated list of rules near the top of your Snakefile with the localrules key:

localrules: all, clean
Solution

Your full Snakefile should now contain the following:

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

REF_FA = "metadata/ref.fa"

localrules: all

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)


rule trim:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    log:
        "log/trim/{smp}.log",
    shell:
        "scripts/trim.sh {input} >{output} 2>{log}"


rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


rule count:
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    log:
        "log/count/count.log",
    shell:
        "scripts/count.sh {input} >{output} 2>{log}"


rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    log:
        "log/fastqc/{smp}.log",
    shell:
        "scripts/fastqc.sh {input} res &>{log}"
rm -r res log/*

snakemake --profile slurm_profile


7. Rule-specific resources: threads

Because the number of threads is a parameter that commonly differs between rules, Snakemake has a threads directive that we can add to any rule. This will tell Snakemake how many threads it should use for any single job that the rule in question generates.

Hints

It is either --cpus-per-task or --ntasks.

Solution

The threads directive in a Snakefile corresponds to the --cpus-per-task SLURM directive. (Recall that SLURM uses the terms CPU, cores and threads roughly equivalently.)

When specifying multiple --cpus-per-task (SLURM) / threads (Snakemake), you are allowing a single process (task) to perform multithreading.


When using the threads directive, we will generally also tell the program itself about the number of threads – see the example below, where we are calling the STAR aligner directly in a Snakemake rule:

rule STAR:
    threads: 8
    shell: "STAR --runThreadN {threads} ..."

For now, we’ll keep it simple and merely use the threads directive:

Solution

Rule map should now contain the following:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    threads: 4
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


8. Rule-specific resources: other

We can use a resources key for any rule to specify (mostly) arbitrary key-value pairs with resources:

rule heavy_stuff:
    # [...]
    resources: mem_mb=50000
    # [...]

These are arbitrary in the sense that mem_mb will not directly set the actual maximum memory usage in the way that the threads directive did set the number of threads. Instead, they refer to the keys we use in our config.yaml.

For example, to actually request 50,000 MB of memory for rule heavy_stuff, mem_mb should correspond to the term used in the SLURM resource request in config.yaml:

cluster: "sbatch --mem={resources.mem_mb} ..."
default-resources: [mem_mb=4000, ...]

Setting resources: mem_mb=50000 for rule heavy_stuff will then override the default value of 4000, and pass that on the SLURM job request!

Solution

Assuming time is referred to as time_min in slurm_profile/config.yaml (this has to match!), rule map should now contain the following:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    threads: 8
    resources: time_min=120
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


Solution

There are a few different options:


  1. Note that it doesn’t seem to be possible to install this extension in VS Code Server at OSC, but you can install it locally.↩︎

  2. This may look a little strange (but think 1-element tuple!), yet is somewhat useful: when every line has a trailing comma, you can always add and remove lines without paying attention to whether a new item follows.↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".