pracs-sp21: Snakemake

Relevant course modules

Week 13

Command-line interface

Rules or output files can optionally be provided as arguments:
- snakemake -j1 – run the first rule and all its dependencies.
- snakemake -j1 my_rule – run the rule my_rule and all its dependencies.
- snakemake -j1 my_output_file – run whatever rules are needed to be run to produce the output file my_output_file. In the Snakefile, my_output_file can either be literally specified as an output file, or can be inferred by Snakemake from an output directive given possible wildcard values.
If a Snakefile is called either of the following (relative to the dir from which snakemake is called), it will be automatically detected: Snakefile / snakefile / workflow/Snakefile / workflow/snakefile. Otherwise, use the -s option to specify the Snakefile to run.

Option	Meaning	Example
`-j` / `--jobs` / `--cores`	Mandatory option: maximum number of jobs to be run in parallel. At OSC, this will be the max. nr. of SLURM jobs to be submitted; when running locally, this should not exceed the number of cores.	`snakemake -j1` `snakemake -j100`
`-n` / `--dryrun`	Don’t run anything, just report what would be run.
`-p` / `--printshellcmds`	Print commands from `shell` directives that will be executed.
`-r` / `--reason`	Give reason of execution for every job.
`-q` / `--quiet`	Snakemake will print less output to screen – can be useful with `-n` just to get an overview of jobs that will be run.
`-s` / `--snakefile`	Name of / path to the Snakefile.	Run the Snakefile `rules.smk`: `snakemake -s rules.smk`
`--use-conda`	In combination with `conda` directive(s) in the Snakefile, will run jobs in a Conda environment.
`--cluster`	Basically a prefix that should be added to any command in order to submit to a cluster. At OSC, this will need to be at a minimum `sbatch --account=PAS1855`, and any number of SLURM options can be added. With additional non-default SLURM options, it becomes more practical to use a profile (see below).	Have at most 50 jobs in the queue: `snakemake -j50 --cluster \` `"sbatch --account=PAS1855"` Every SLURM job should have a time limit of 10 minutes: `snakemake -j50 --cluster \` `"sbatch --account=PAS1855 --time=10"`
`---lint`	Run the Snakemake “linter” on the Snakefile – check syntax, some best practices, and so on.
`-f` / `--force <target>`	Force creation of a specified target file, or if nothing is specified, the first rule.	Force-run whatever is needed to produce smpA.bam: `snakemake -f smpA.bam` Force-run the first rule: `snakemake -f`
`-F` / `--forceall <rule>`	Force running a specified rule and all dependencies of that rule, or if nothing is specified, the first rule.	Force-run “rule map:” `snakemake -F map` Force-run the workflow if first rule is “rule all”: `snakemake -F`
`-R` / `--forcerun`	Force creation of a list of target files. Useful in combination with the `--list-code-changes` option which will output a list of output files affected by changes in the Snakefile.	The command substitution will pass all relevant output files to the -R option: `snakemake -j1 -R \` `$(snakemake --list-code-changes)`
`--report`	Create an HTML report with runtime statistics, workflow
`--dag`	Create a “Directed Acyclic Graph” (DAG) of all jobs in the workflow.	`snakemake --dag \| \` `dot -T svg > jobs.svg`
`--rulegraph`	Create a “Directed Acyclic Graph” (DAG) of all rules in the workflow (better for larger workflows). ddddddddddddddddddddddddddddddddddddddd	`snakemake --rulegraph \| \` `dot -T svg > rules.svg` ddddddddddddddddddddddddddddddddddddddddddd

Rules

It is common to have a first rule in the Snakefile called rule all that only has an input directive listing all final output files. (Recall that “final” here means any output files that are not used as input by other rules, e.g. MultiQC output.)
Relationships/dependencies between rules are implicit: Snakemake infers these from input and output directives.
The order of rules in the Snakefile is not important except that the first rule is run by default.

Rule directives

Directive	Expected values	Examples
`input`	Input file(s) for the rule. When using a wildcard `{}`, multiple jobs are run for the rule, each with one of the possible wildcard values. If you need to run all wildcard values (samples/files) at once, use `expand()`.	`input: "ref.fa"` With wildcard: `input: "{smp}.fq"` Inputs can be named: `input: fq="my.fq", ref="ref.fa"` Use the expand() function: `input: expand()`
`output`	Like `input` but specifies output files: files produced by the rule. If using wildcards, `output` should have the same wildcard(s) as `input`.	`output: {smp}.bam`
`log`	Like `output` but meant for logging and error output: `log` files are not deleted if jobs fail. Recall that `2>` redirects standard error whereas `&>` redirects standard out and standard error.	Wildcards can be used: `log: log/map_{sample}.log` Using log files in actions: `map.sh {input.fq} {input.ref} >{output} 2>{log}` `fastqc.sh {input} res &>{log}`
`shell`	Run any arbitrary shell command: e.g. calling an external program or script. Other “action directives” that we did not use in this course include `run` for Python code entered directly in the Snakefile.	`shell: "map.sh {input.ref} {input.fq} > {output}"` `shell: "cat {smp}.txt \| tr a-z A-Z > {smp}.out"` `shell: "fastqc {input}"`
`params`	Can be used to clearly and separately indicate certain variables/parameters that are used in the action.	`params: minqual="30", minlen="100"` `params: indir="res"` `shell: "my.sh {params.indir} {wildcards.smp}"`
`threads`	The number of CPUs/cores/threads to be used by a single job – corresponds to SLURM’s `--cpus-per-task` option.	`rule STAR: ... threads: 8 shell: "STAR --runThreadN {threads} ..."`
`resources`	Mostly arbitrary key-value pairs with resources: the keys should match those specified in a `config.yaml` profile config file (see below), so the appropriate resources are requested for the SLURM job.	`resources: mem_mb=50000`
`conda`	Used to specify a YAML file with a Conda environment description. Snakemake will perform the one-time Conda installation and use the resulting Conda environment when running the rule. Note: this also requires the `--use-conda` option to be specified when running Snakemake! ddddddddddddddddddddddddddddddddddddd	`conda: "envs/fastqc.yaml"` An example YAML file: `channels: - bioconda dependencies: - fastqc=0.11.8`

Placeholders and wildcards

Note that wildcards operate entirely within a single rule and not across rules! That is, even though it often makes sense to use the exact same wildcard across multiple rules, Snakemake will resolve them separately for each rule.

Placeholder Explanation Examples

Placeholder	Explanation	Examples
`{input}`	Refers to the file(s) specified in the `input` directive – to be used in “action directives” such as `shell`. (If `input` has a wildcard, one file is passed for each job i.e. iteration of the rule.)	`shell: "trim.sh {input} > {output}"` Named input: `shell: map.sh {input.ref} {input.fq} > {output}`
`{output}`	Refers to the file(s) specified in the `output` directive – to be used in “action directives” such as `shell`.	`shell: "trim.sh {input} > {output}"`
`{…}`	A wildcard, which can be given any name. Used as-is in `input`, `output`, and `log` directives, for which Snakemake resolves the values of the wildcards by looking for requested output files it can produce. To use a wildcard in a `shell` directive, assuming the wildcard is called `sample`, use: `{wildcards.sample}`. ddddddddddddddddddddddddddddddddddddd	`input: {sample}.fq` `output: {sample}.bam` To use a wildcard in an action like a shell directive: `shell: my.sh -i {wildcards.sample} > {output}`

{input}

Refers to the file(s) specified in the input directive – to be used in “action directives” such as shell. (If input has a wildcard, one file is passed for each job i.e. iteration of the rule.)

shell: "trim.sh {input} > {output}"

Named input:
shell: map.sh {input.ref} {input.fq} > {output}

{output} Refers to the file(s) specified in the output directive – to be used in “action directives” such as shell. shell: "trim.sh {input} > {output}"

{…}

A wildcard, which can be given any name.

Used as-is in input, output, and log directives, for which Snakemake resolves the values of the wildcards by looking for requested output files it can produce.

To use a wildcard in a shell directive, assuming the wildcard is called sample, use: {wildcards.sample}.
ddddddddddddddddddddddddddddddddddddd

input: {sample}.fq
output: {sample}.bam

To use a wildcard in an action like a shell directive:
shell: my.sh -i {wildcards.sample} > {output}

Functions

Snakemake provides a few convenience functions, most notably expand() and glob_wildcards(). Note that you can interactively test these functions in Python after importing them using:

# Just for interactive testing -- does not need to be done in a Snakefile:
from snakemake.io import expand, glob_wildcards

`expand()`

expand() is a more succinct alternative to a list comprehension, which will replace one or more placeholders {} with all possible values from a list. If multiple lists are provided, all combinations of these lists will be generated – see the second example below:

# Example with a single list "SAMPLES":
SAMPLES=["sampleA", "sampleB", "sampleC"] 
expand("res/{sample}.bam", sample=SAMPLES)
#> ['res/smpA.bam', 'res/smpB.bam', 'res/smpC.bam']

# Example with two lists, "SAMPLES" and "READS":
SAMPLES = ["sampleA", "sampleB", "sampleC"]
READS = ["R1", "R2"]
expand("{sample}_{read}.fastq.gz", sample=SAMPLES, read=READS)
#> ['sampleA_R1.fastq.gz', 'sampleA_R2.fastq.gz',
#>  'sampleB_R1.fastq.gz', 'sampleB_R2.fastq.gz',
#>  'sampleC_R1.fastq.gz', 'sampleC_R2.fastq.gz']

`glob_wildcards()`

glob_wildcards() will perform shell globbing, i.e. search for existing files, and store one or more sets of values –usually sample names– that are extracted from the file names (akin to regex backreferences):

# Typical usage at the top of a Snakefile:
SAMPLES = glob_wildcards("data/{sample}.fastq").sample

# Or equivalently, with a trailing comma rather than trailing `.<wildcard-name>`:
SAMPLES, = glob_wildcards("data/{sample}.fastq")

# With two wildcards:
SAMPLES,READS = glob_wildcards("data/{sample}_{read}.fastq")

# Checking how it works in IPython -- with one wildcard:

!ls data/
#> data/sampleA.fastq data/sampleB.fastq data/sampleC.fastq

glob_wildcards("data/{sample}.fastq").sample
#> ['sampleA', 'sampleB', 'sampleC']

# Checking how it works in IPython -- with two wildcards:

!ls data/
#> A_R1.fastq.gz A_R2.fastq.gz B_R1.fastq.gz B_R2.fastq.gz

glob_wildcards("data/{sample}_{read}.fastq.gz").sample
#> ['A', 'A', 'B', 'B']
glob_wildcards("data/{sample}_{read}.fastq.gz").read
#> ['R1', 'R2', 'R1', 'R2']

Others

temp() will mark files as temporary – to be deleted if the workflow finishes:
```
output: temp("mapped/{sample}.bam")
```

protected() will mark files as protected (no write permissions):
```
output: protected("sorted_reads/{sample}.bam")
```

Misc. tips & tricks

Configuration file

To avoid hardcoding certain run-specific variables in the Snakemake (sample names, output directories, parameters for software, and so on), you can use a YAML or JSON-formatted configuration file and include a configfile directive somewhere at the top of the Snakefile:

# Include this line in the Snakefile to read the file "config.yml":
configfile: "config.yml"

# Now, the contents of "config.yml" is available in a dictionary:
OUT_DIR=config["output_dir"]

# Say "config.yaml" just contains the following line:
output_dir: path/to/output/

Profile dir with SLURM and other settings

A configuration file (or set of files) can also be used to pass options to a Snakemake call. This is particularly handy with SLURM options, since a long command like snakemake -j100 --cluster "--account=PAS1855 --time=12:00:00 --mem=12G" is not very practical to type whenever running Snakemake.

Add a file called config.yaml in a directory with an arbitrary name – but something like slurm_profile makes sense.

In this file, specify options that can also be passed to snakemake on the command line, e.g.:

# Just beware that options are followed by a colon ":" in a YAML file:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]

# You can also specify other options than cluster-specific ones:
jobs: 100
latency-wait: 30
use-conda: true     # "--use-conda" at the command line becomes this

Then, use the --profile option and specify the name of the directory containing the config.yaml file:

# (With the above config.yaml, we now no longer need to add the -j option:)   
snakemake --profile slurm_profile

When using Snakemake with SLURM, it often makes sense to designate come rules as “local rules” meaning that they will not be submitted as SLURM jobs. This can be done with the localrules directive that can be specified somewhere near the top of the Snakefile:

localrules: all, clean

Also, be aware that the main Snakemake process will run as long any job in the workflow is running, which could be days in some cases. Even though this process takes almost no resources, it would still be killed by OSC as soon as it runs over 20 minutes. Therefore, it often makes sense to run Snakemake as a SLURM job itself. Below, we create a script called snakemake_script.sh and then submit it:

#!/bin/bash

#SBATCH --account=PAS1855
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4

set -e -u -o pipefail

snakemake -j1 -p

sbatch snakemake_script.sh

“Token” output file

Use a “token” output file if a rule has no unique output, for example because it only modifies an existing file. Then, touch the token file in the rule’s action:

rule token_example:
    input:  'file.txt'
    output: 'token_file'   # The name of the file is arbitrary - but it will be created
    shell: "my_command {input} && touch {output}"

A few worked examples

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

rule all:
    input: "res/count_table.txt",

rule trim:
    input: "data/{smp}.fastq",
    output: "res/{smp}_trim.fastq",
    shell: "scripts/trim.sh {input} > {output}"

rule map:
    input: "res/{smp}_trim.fastq",
    output: "res/{smp}.bam",
    shell: "scripts/map.sh {input} > {output}"

rule count:
    input: expand("res/{smp}.bam", smp=SAMPLES),
    output: "res/count_table.txt",
    shell: "scripts/count.sh {input} > {output}"

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

REF_FA = "metadata/ref.fa"

localrules: all

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)


rule trim:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    log:
        "log/trim/{smp}.log",
    shell:
        "scripts/trim.sh {input} >{output} 2>{log}"


rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


rule count:
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    log:
        "log/count/count.log",
    shell:
        "scripts/count.sh {input} >{output} 2>{log}"


rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    log:
        "log/fastqc/{smp}.log",
    shell:
        "scripts/fastqc.sh {input} res &>{log}"

Snakemake