Snakemake

Topic overview


Relevant course modules


Command-line interface

Option Meaning Example
-j / --jobs / --cores Mandatory option: maximum number of jobs to be run in parallel. At OSC, this will be the max. nr. of SLURM jobs to be submitted; when running locally, this should not exceed the number of cores.

snakemake -j1

snakemake -j100

-n / --dryrun Don’t run anything, just report what would be run.
-p / --printshellcmds Print commands from shell directives that will be executed.
-r / --reason Give reason of execution for every job.
-q / --quiet Snakemake will print less output to screen – can be useful with -n just to get an overview of jobs that will be run.
-s / --snakefile Name of / path to the Snakefile. Run the Snakefile rules.smk:
snakemake -s rules.smk
--use-conda In combination with conda directive(s) in the Snakefile, will run jobs in a Conda environment.
--cluster

Basically a prefix that should be added to any command in order to submit to a cluster.

At OSC, this will need to be at a minimum sbatch --account=PAS1855, and any number of SLURM options can be added.

With additional non-default SLURM options, it becomes more practical to use a profile (see below).

Have at most 50 jobs in the queue:
snakemake -j50 --cluster \
    "sbatch --account=PAS1855"

Every SLURM job should have a time limit of 10 minutes:
snakemake -j50 --cluster \
    "sbatch --account=PAS1855 --time=10"

---lint Run the Snakemake “linter” on the Snakefile – check syntax, some best practices, and so on.
-f / --force <target> Force creation of a specified target file, or if nothing is specified, the first rule. Force-run whatever is needed to produce smpA.bam:
snakemake -f smpA.bam
Force-run the first rule:
snakemake -f
-F / --forceall <rule> Force running a specified rule and all dependencies of that rule, or if nothing is specified, the first rule. Force-run “rule map:”
snakemake -F map
Force-run the workflow if first rule is “rule all”:
snakemake -F
-R / --forcerun

Force creation of a list of target files.

Useful in combination with the --list-code-changes option which will output a list of output files affected by changes in the Snakefile.

The command substitution will pass all relevant output files to the -R option:
snakemake -j1 -R \
    $(snakemake --list-code-changes)
--report Create an HTML report with runtime statistics, workflow
--dag Create a “Directed Acyclic Graph” (DAG) of all jobs in the workflow. snakemake --dag | \
    dot -T svg > jobs.svg
--rulegraph Create a “Directed Acyclic Graph” (DAG) of all rules in the workflow (better for larger workflows).
ddddddddddddddddddddddddddddddddddddddd
snakemake --rulegraph | \
    dot -T svg > rules.svg
ddddddddddddddddddddddddddddddddddddddddddd

Rules

Rule directives

Directive Expected values Examples
input

Input file(s) for the rule.

When using a wildcard {}, multiple jobs are run for the rule, each with one of the possible wildcard values.

If you need to run all wildcard values (samples/files) at once, use expand().

input: "ref.fa"
With wildcard:
input: "{smp}.fq"
Inputs can be named:
input: fq="my.fq", ref="ref.fa"
Use the expand() function:
input: expand()
output Like input but specifies output files: files produced by the rule. If using wildcards, output should have the same wildcard(s) as input. output: {smp}.bam
log

Like output but meant for logging and error output: log files are not deleted if jobs fail.

Recall that 2> redirects standard error whereas &> redirects standard out and standard error.

Wildcards can be used:
log: log/map_{sample}.log

Using log files in actions:
map.sh {input.fq} {input.ref} >{output} 2>{log}
fastqc.sh {input} res &>{log}

shell

Run any arbitrary shell command: e.g. calling an external program or script.

Other “action directives” that we did not use in this course include run for Python code entered directly in the Snakefile.

shell: "map.sh {input.ref} {input.fq} > {output}"

shell: "cat {smp}.txt | tr a-z A-Z > {smp}.out"

shell: "fastqc {input}"

params Can be used to clearly and separately indicate certain variables/parameters that are used in the action.

params: minqual="30", minlen="100"

params: indir="res"
shell: "my.sh {params.indir} {wildcards.smp}"

threads The number of CPUs/cores/threads to be used by a single job – corresponds to SLURM’s --cpus-per-task option.
rule STAR:
    ...
    threads: 8
    shell: "STAR --runThreadN {threads} ..."
resources Mostly arbitrary key-value pairs with resources: the keys should match those specified in a config.yaml profile config file (see below), so the appropriate resources are requested for the SLURM job. resources: mem_mb=50000
conda

Used to specify a YAML file with a Conda environment description. Snakemake will perform the one-time Conda installation and use the resulting Conda environment when running the rule.

Note: this also requires the --use-conda option to be specified when running Snakemake!
ddddddddddddddddddddddddddddddddddddd

conda: "envs/fastqc.yaml"

An example YAML file:

channels:
  - bioconda
dependencies:
  - fastqc=0.11.8


Placeholders and wildcards

Note that wildcards operate entirely within a single rule and not across rules! That is, even though it often makes sense to use the exact same wildcard across multiple rules, Snakemake will resolve them separately for each rule.

Placeholder Explanation Examples
{input} Refers to the file(s) specified in the input directive – to be used in “action directives” such as shell. (If input has a wildcard, one file is passed for each job i.e. iteration of the rule.)

shell: "trim.sh {input} > {output}"

Named input:
shell: map.sh {input.ref} {input.fq} > {output}

{output} Refers to the file(s) specified in the output directive – to be used in “action directives” such as shell. shell: "trim.sh {input} > {output}"
{…}

A wildcard, which can be given any name.

Used as-is in input, output, and log directives, for which Snakemake resolves the values of the wildcards by looking for requested output files it can produce.

To use a wildcard in a shell directive, assuming the wildcard is called sample, use: {wildcards.sample}.
ddddddddddddddddddddddddddddddddddddd

input: {sample}.fq
output: {sample}.bam

To use a wildcard in an action like a shell directive:
shell: my.sh -i {wildcards.sample} > {output}


Functions

Snakemake provides a few convenience functions, most notably expand() and glob_wildcards(). Note that you can interactively test these functions in Python after importing them using:

# Just for interactive testing -- does not need to be done in a Snakefile:
from snakemake.io import expand, glob_wildcards

expand()

expand() is a more succinct alternative to a list comprehension, which will replace one or more placeholders {} with all possible values from a list. If multiple lists are provided, all combinations of these lists will be generated – see the second example below:

# Example with a single list "SAMPLES":
SAMPLES=["sampleA", "sampleB", "sampleC"] 
expand("res/{sample}.bam", sample=SAMPLES)
#> ['res/smpA.bam', 'res/smpB.bam', 'res/smpC.bam']

# Example with two lists, "SAMPLES" and "READS":
SAMPLES = ["sampleA", "sampleB", "sampleC"]
READS = ["R1", "R2"]
expand("{sample}_{read}.fastq.gz", sample=SAMPLES, read=READS)
#> ['sampleA_R1.fastq.gz', 'sampleA_R2.fastq.gz',
#>  'sampleB_R1.fastq.gz', 'sampleB_R2.fastq.gz',
#>  'sampleC_R1.fastq.gz', 'sampleC_R2.fastq.gz']

glob_wildcards()

glob_wildcards() will perform shell globbing, i.e. search for existing files, and store one or more sets of values –usually sample names– that are extracted from the file names (akin to regex backreferences):

# Typical usage at the top of a Snakefile:
SAMPLES = glob_wildcards("data/{sample}.fastq").sample

# Or equivalently, with a trailing comma rather than trailing `.<wildcard-name>`:
SAMPLES, = glob_wildcards("data/{sample}.fastq")

# With two wildcards:
SAMPLES,READS = glob_wildcards("data/{sample}_{read}.fastq")
# Checking how it works in IPython -- with one wildcard:

!ls data/
#> data/sampleA.fastq data/sampleB.fastq data/sampleC.fastq

glob_wildcards("data/{sample}.fastq").sample
#> ['sampleA', 'sampleB', 'sampleC']
# Checking how it works in IPython -- with two wildcards:

!ls data/
#> A_R1.fastq.gz A_R2.fastq.gz B_R1.fastq.gz B_R2.fastq.gz

glob_wildcards("data/{sample}_{read}.fastq.gz").sample
#> ['A', 'A', 'B', 'B']
glob_wildcards("data/{sample}_{read}.fastq.gz").read
#> ['R1', 'R2', 'R1', 'R2']

Others


Misc. tips & tricks

Configuration file

To avoid hardcoding certain run-specific variables in the Snakemake (sample names, output directories, parameters for software, and so on), you can use a YAML or JSON-formatted configuration file and include a configfile directive somewhere at the top of the Snakefile:

# Include this line in the Snakefile to read the file "config.yml":
configfile: "config.yml"

# Now, the contents of "config.yml" is available in a dictionary:
OUT_DIR=config["output_dir"]
# Say "config.yaml" just contains the following line:
output_dir: path/to/output/

Profile dir with SLURM and other settings

A configuration file (or set of files) can also be used to pass options to a Snakemake call. This is particularly handy with SLURM options, since a long command like snakemake -j100 --cluster "--account=PAS1855 --time=12:00:00 --mem=12G" is not very practical to type whenever running Snakemake.

  1. Add a file called config.yaml in a directory with an arbitrary name – but something like slurm_profile makes sense.

  2. In this file, specify options that can also be passed to snakemake on the command line, e.g.:

    # Just beware that options are followed by a colon ":" in a YAML file:
    
    cluster: "sbatch --account={resources.account}
                     --time={resources.time_min}
                     --mem={resources.mem_mb}
                     --cpus-per-task={resources.cpus}
                     --output=log/slurm-{rule}_{wildcards}.out"
    default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
    
    # You can also specify other options than cluster-specific ones:
    jobs: 100
    latency-wait: 30
    use-conda: true     # "--use-conda" at the command line becomes this
  3. Then, use the --profile option and specify the name of the directory containing the config.yaml file:

    # (With the above config.yaml, we now no longer need to add the -j option:)   
    snakemake --profile slurm_profile

When using Snakemake with SLURM, it often makes sense to designate come rules as “local rules” meaning that they will not be submitted as SLURM jobs. This can be done with the localrules directive that can be specified somewhere near the top of the Snakefile:

localrules: all, clean

Also, be aware that the main Snakemake process will run as long any job in the workflow is running, which could be days in some cases. Even though this process takes almost no resources, it would still be killed by OSC as soon as it runs over 20 minutes. Therefore, it often makes sense to run Snakemake as a SLURM job itself. Below, we create a script called snakemake_script.sh and then submit it:

#!/bin/bash

#SBATCH --account=PAS1855
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

set -e -u -o pipefail

snakemake -j1 -p
sbatch snakemake_script.sh

“Token” output file

Use a “token” output file if a rule has no unique output, for example because it only modifies an existing file. Then, touch the token file in the rule’s action:

rule token_example:
    input:  'file.txt'
    output: 'token_file'   # The name of the file is arbitrary - but it will be created
    shell: "my_command {input} && touch {output}"


A few worked examples

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

rule all:
    input: "res/count_table.txt",

rule trim:
    input: "data/{smp}.fastq",
    output: "res/{smp}_trim.fastq",
    shell: "scripts/trim.sh {input} > {output}"

rule map:
    input: "res/{smp}_trim.fastq",
    output: "res/{smp}.bam",
    shell: "scripts/map.sh {input} > {output}"

rule count:
    input: expand("res/{smp}.bam", smp=SAMPLES),
    output: "res/count_table.txt",
    shell: "scripts/count.sh {input} > {output}"
SAMPLES = glob_wildcards("data/{smp}.fastq").smp

REF_FA = "metadata/ref.fa"

localrules: all

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)


rule trim:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    log:
        "log/trim/{smp}.log",
    shell:
        "scripts/trim.sh {input} >{output} 2>{log}"


rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


rule count:
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    log:
        "log/count/count.log",
    shell:
        "scripts/count.sh {input} >{output} 2>{log}"


rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    log:
        "log/fastqc/{smp}.log",
    shell:
        "scripts/fastqc.sh {input} res &>{log}"

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".