Topic overview
Rules or output files can optionally be provided as arguments:
snakemake -j1
– run the first rule and all its dependencies.
snakemake -j1 my_rule
– run the rule my_rule
and all its dependencies.
snakemake -j1 my_output_file
– run whatever rules are needed to be run to produce the output file my_output_file
. In the Snakefile, my_output_file
can either be literally specified as an output
file, or can be inferred by Snakemake from an output
directive given possible wildcard values.
If a Snakefile is called either of the following (relative to the dir from which snakemake is called), it will be automatically detected: Snakefile
/ snakefile
/ workflow/Snakefile
/ workflow/snakefile
. Otherwise, use the -s
option to specify the Snakefile to run.
Option | Meaning | Example |
---|---|---|
-j / --jobs / --cores |
Mandatory option: maximum number of jobs to be run in parallel. At OSC, this will be the max. nr. of SLURM jobs to be submitted; when running locally, this should not exceed the number of cores. |
|
-n / --dryrun |
Don’t run anything, just report what would be run. | |
-p / --printshellcmds |
Print commands from shell directives that will be executed. |
|
-r / --reason |
Give reason of execution for every job. | |
-q / --quiet |
Snakemake will print less output to screen – can be useful with -n just to get an overview of jobs that will be run. |
|
-s / --snakefile |
Name of / path to the Snakefile. | Run the Snakefile rules.smk : snakemake -s rules.smk |
--use-conda |
In combination with conda directive(s) in the Snakefile, will run jobs in a Conda environment. |
|
--cluster |
Basically a prefix that should be added to any command in order to submit to a cluster. At OSC, this will need to be at a minimum With additional non-default SLURM options, it becomes more practical to use a profile (see below). |
Have at most 50 jobs in the queue: Every SLURM job should have a time limit of 10 minutes: |
---lint |
Run the Snakemake “linter” on the Snakefile – check syntax, some best practices, and so on. | |
-f / --force <target> |
Force creation of a specified target file, or if nothing is specified, the first rule. | Force-run whatever is needed to produce smpA.bam: snakemake -f smpA.bam Force-run the first rule: snakemake -f |
-F / --forceall <rule> |
Force running a specified rule and all dependencies of that rule, or if nothing is specified, the first rule. | Force-run “rule map:” snakemake -F map Force-run the workflow if first rule is “rule all”: snakemake -F |
-R / --forcerun |
Force creation of a list of target files. Useful in combination with the |
The command substitution will pass all relevant output files to the -R option: snakemake -j1 -R \ $(snakemake --list-code-changes) |
--report |
Create an HTML report with runtime statistics, workflow | |
--dag |
Create a “Directed Acyclic Graph” (DAG) of all jobs in the workflow. | snakemake --dag | \ dot -T svg > jobs.svg |
--rulegraph |
Create a “Directed Acyclic Graph” (DAG) of all rules in the workflow (better for larger workflows). ddddddddddddddddddddddddddddddddddddddd |
snakemake --rulegraph | \ dot -T svg > rules.svg ddddddddddddddddddddddddddddddddddddddddddd |
rule all
that only has an input
directive listing all final output files. (Recall that “final” here means any output files that are not used as input by other rules, e.g. MultiQC output.)input
and output
directives.Directive | Expected values | Examples |
---|---|---|
input |
Input file(s) for the rule. When using a wildcard If you need to run all wildcard values (samples/files) at once, use |
input: "ref.fa" With wildcard: input: "{smp}.fq" Inputs can be named: input: fq="my.fq", ref="ref.fa" Use the expand() function: input: expand() |
output |
Like input but specifies output files: files produced by the rule. If using wildcards, output should have the same wildcard(s) as input . |
output: {smp}.bam |
log |
Like Recall that |
Wildcards can be used: Using log files in actions: |
shell |
Run any arbitrary shell command: e.g. calling an external program or script. Other “action directives” that we did not use in this course include |
|
params |
Can be used to clearly and separately indicate certain variables/parameters that are used in the action. |
|
threads |
The number of CPUs/cores/threads to be used by a single job – corresponds to SLURM’s --cpus-per-task option. |
|
resources |
Mostly arbitrary key-value pairs with resources: the keys should match those specified in a config.yaml profile config file (see below), so the appropriate resources are requested for the SLURM job. |
resources: mem_mb=50000 |
conda |
Used to specify a YAML file with a Conda environment description. Snakemake will perform the one-time Conda installation and use the resulting Conda environment when running the rule. Note: this also requires the |
An example YAML file:
|
Note that wildcards operate entirely within a single rule and not across rules! That is, even though it often makes sense to use the exact same wildcard across multiple rules, Snakemake will resolve them separately for each rule.
Placeholder | Explanation | Examples |
---|---|---|
{input} |
Refers to the file(s) specified in the input directive – to be used in “action directives” such as shell . (If input has a wildcard, one file is passed for each job i.e. iteration of the rule.) |
Named input: |
{output} |
Refers to the file(s) specified in the output directive – to be used in “action directives” such as shell . |
shell: "trim.sh {input} > {output}" |
{…} |
A wildcard, which can be given any name. Used as-is in To use a wildcard in a |
To use a wildcard in an action like a shell directive: |
Snakemake provides a few convenience functions, most notably expand()
and glob_wildcards()
. Note that you can interactively test these functions in Python after importing them using:
# Just for interactive testing -- does not need to be done in a Snakefile:
from snakemake.io import expand, glob_wildcards
expand()
expand()
is a more succinct alternative to a list comprehension, which will replace one or more placeholders {}
with all possible values from a list. If multiple lists are provided, all combinations of these lists will be generated – see the second example below:
# Example with a single list "SAMPLES":
=["sampleA", "sampleB", "sampleC"]
SAMPLES"res/{sample}.bam", sample=SAMPLES)
expand(#> ['res/smpA.bam', 'res/smpB.bam', 'res/smpC.bam']
# Example with two lists, "SAMPLES" and "READS":
= ["sampleA", "sampleB", "sampleC"]
SAMPLES = ["R1", "R2"]
READS "{sample}_{read}.fastq.gz", sample=SAMPLES, read=READS)
expand(#> ['sampleA_R1.fastq.gz', 'sampleA_R2.fastq.gz',
#> 'sampleB_R1.fastq.gz', 'sampleB_R2.fastq.gz',
#> 'sampleC_R1.fastq.gz', 'sampleC_R2.fastq.gz']
glob_wildcards()
glob_wildcards()
will perform shell globbing, i.e. search for existing files, and store one or more sets of values –usually sample names– that are extracted from the file names (akin to regex backreferences):
# Typical usage at the top of a Snakefile:
= glob_wildcards("data/{sample}.fastq").sample
SAMPLES
# Or equivalently, with a trailing comma rather than trailing `.<wildcard-name>`:
= glob_wildcards("data/{sample}.fastq")
SAMPLES,
# With two wildcards:
= glob_wildcards("data/{sample}_{read}.fastq") SAMPLES,READS
# Checking how it works in IPython -- with one wildcard:
!ls data/
#> data/sampleA.fastq data/sampleB.fastq data/sampleC.fastq
"data/{sample}.fastq").sample
glob_wildcards(#> ['sampleA', 'sampleB', 'sampleC']
# Checking how it works in IPython -- with two wildcards:
!ls data/
#> A_R1.fastq.gz A_R2.fastq.gz B_R1.fastq.gz B_R2.fastq.gz
"data/{sample}_{read}.fastq.gz").sample
glob_wildcards(#> ['A', 'A', 'B', 'B']
"data/{sample}_{read}.fastq.gz").read
glob_wildcards(#> ['R1', 'R2', 'R1', 'R2']
temp()
will mark files as temporary – to be deleted if the workflow finishes:
"mapped/{sample}.bam") output: temp(
protected()
will mark files as protected (no write permissions):
"sorted_reads/{sample}.bam") output: protected(
To avoid hardcoding certain run-specific variables in the Snakemake (sample names, output directories, parameters for software, and so on), you can use a YAML or JSON-formatted configuration file and include a configfile
directive somewhere at the top of the Snakefile:
# Include this line in the Snakefile to read the file "config.yml":
"config.yml"
configfile:
# Now, the contents of "config.yml" is available in a dictionary:
=config["output_dir"] OUT_DIR
# Say "config.yaml" just contains the following line:
/to/output/ output_dir: path
A configuration file (or set of files) can also be used to pass options to a Snakemake call. This is particularly handy with SLURM options, since a long command like snakemake -j100 --cluster "--account=PAS1855 --time=12:00:00 --mem=12G"
is not very practical to type whenever running Snakemake.
Add a file called config.yaml
in a directory with an arbitrary name – but something like slurm_profile
makes sense.
In this file, specify options that can also be passed to snakemake
on the command line, e.g.:
# Just beware that options are followed by a colon ":" in a YAML file:
"sbatch --account={resources.account}
cluster: --time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
default
# You can also specify other options than cluster-specific ones:
100
jobs: -wait: 30
latency-conda: true # "--use-conda" at the command line becomes this use
Then, use the --profile
option and specify the name of the directory containing the config.yaml
file:
# (With the above config.yaml, we now no longer need to add the -j option:)
--profile slurm_profile snakemake
When using Snakemake with SLURM, it often makes sense to designate come rules as “local rules” meaning that they will not be submitted as SLURM jobs. This can be done with the localrules
directive that can be specified somewhere near the top of the Snakefile:
all, clean localrules:
Also, be aware that the main Snakemake process will run as long any job in the workflow is running, which could be days in some cases. Even though this process takes almost no resources, it would still be killed by OSC as soon as it runs over 20 minutes. Therefore, it often makes sense to run Snakemake as a SLURM job itself. Below, we create a script called snakemake_script.sh
and then submit it:
#!/bin/bash
#SBATCH --account=PAS1855
#SBATCH --time=24:00:00
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
set -e -u -o pipefail
snakemake -j1 -p
sbatch snakemake_script.sh
Use a “token” output file if a rule has no unique output, for example because it only modifies an existing file. Then, touch
the token file in the rule’s action:
rule token_example:input: 'file.txt'
'token_file' # The name of the file is arbitrary - but it will be created
output: "my_command {input} && touch {output}" shell:
= glob_wildcards("data/{smp}.fastq").smp
SAMPLES
all:
rule input: "res/count_table.txt",
rule trim:input: "data/{smp}.fastq",
"res/{smp}_trim.fastq",
output: "scripts/trim.sh {input} > {output}"
shell:
map:
rule input: "res/{smp}_trim.fastq",
"res/{smp}.bam",
output: "scripts/map.sh {input} > {output}"
shell:
rule count:input: expand("res/{smp}.bam", smp=SAMPLES),
"res/count_table.txt",
output: "scripts/count.sh {input} > {output}" shell:
= glob_wildcards("data/{smp}.fastq").smp
SAMPLES
= "metadata/ref.fa"
REF_FA
all
localrules:
all:
rule input:
"res/count_table.txt",
"res/{smp}.fastqc.html", smp=SAMPLES)
expand(
rule trim:input:
"data/{smp}.fastq",
output:"res/{smp}_trim.fastq",
log:"log/trim/{smp}.log",
shell:"scripts/trim.sh {input} >{output} 2>{log}"
map:
rule input:
="res/{smp}_trim.fastq",
fastq=REF_FA,
ref
output:"res/{smp}.bam",
log:"log/map/{smp}.log",
shell:"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
rule count:input:
"res/{smp}.bam", smp=SAMPLES),
expand(
output:"res/count_table.txt",
log:"log/count/count.log",
shell:"scripts/count.sh {input} >{output} 2>{log}"
rule fastqc:input:
"data/{smp}.fastq",
output:"res/{smp}.fastqc.html",
log:"log/fastqc/{smp}.log",
shell:"scripts/fastqc.sh {input} res &>{log}"
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".