In these exercises, you’ll take up where we left off in class, using a dummy RNA-seq workflow.
Load the Conda environment in which you have Snakemake installed:
$ module load python/3.6-conda5.2
$ source activate ipy-env
# Check if Snakemake can be found and is inside your ipy-env:
$ which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemake
We will recreate a very similar set of files and the final Snakefile from class – copy the files from /fs/ess/PAS1855/exercises/week13/
to do so:
$ cd /fs/ess/PAS1855/users/$USER/week13
$ cp -r /fs/ess/PAS1855/exercises/week13/exercise1 .
$ cd exercise1
$ tree
#> .
#> ├── data
#> │ ├── smpA.fastq
#> │ ├── smpC.fastq
#> │ └── smpG.fastq
#> ├── res
#> ├── scripts
#> │ ├── count.sh
#> │ ├── map.sh
#> │ └── trim.sh
#> └── workflow
#> ├── DAGs
#> └── Snakefile
Create the dir structure and the dummy FASTQ files:
mkdir -p data scripts res workflow/DAGs
echo "AAAAAAA" > data/smpA.fastq
echo "CCCCCCC" > data/smpC.fastq
echo "GGGGGGG" > data/smpG.fastq
Create a script scripts/trim.sh
containing:
#!/bin/bash
set -e -u -o pipefail
echo "FASTQ file: $1 after trimming"
cat "$1"
Create a script scripts/map.sh
containing:
#!/bin/bash
set -e -u -o pipefail
echo "BAM from FASTQ file: $1"
cat "$1"
Create a script scripts/count.sh
containing:
#!/bin/bash
set -e -u -o pipefail
echo "Counts for $# BAM files:"
cat $@
Make the scripts executable:
chmod +x scripts/*
Create a file workflow/Snakefile
and paste into it:
"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.
"""
SAMPLES=glob_wildcards("data/{smp}.fastq").smp
rule all:
"""
This is a "pseudo-rule" meant to trigger the execution of the full workflow.
"""
input:
"res/count_table.txt",
rule trim:
"""
Trim the FASTQ files.
"""
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
shell:
"scripts/trim.sh {input} > {output}"
rule map:
"""
Map the trimmed FASTQ files to a reference genome.
"""
input:
"res/{smp}_trim.fastq",
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input} > {output}"
rule count:
"""
From the BAM files, create a gene count table.
"""
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
shell:
"scripts/count.sh {input} > {output}"
Currently, our Snakefile workflow/Snakefile
contains the following:
"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.
"""
SAMPLES = glob_wildcards("data/{smp}.fastq").smp
rule all:
"""
This is a "pseudo-rule" meant to trigger the execution of the full workflow.
"""
input:
"res/count_table.txt",
rule trim:
"""
Trim the FASTQ files.
"""
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
shell:
"scripts/trim.sh {input} > {output}"
rule map:
"""
Map the trimmed FASTQ files to a reference genome.
"""
input:
"res/{smp}_trim.fastq",
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input} > {output}"
rule count:
"""
From the BAM files, create a gene count table.
"""
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
shell:
"scripts/count.sh {input} > {output}"
Note that some formatting details in the Snakefile above are slightly different from what was shown in class. I used the Snakefmt
VS Code extension, which lets a little program of the same name format Snakefiles for you.1
rule count:
"""
From the BAM files, create a gene count table.
"""
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
shell:
"scripts/count.sh {input} > {output}"
Snakefmt
will detect syntax errors and ensure consistent and clear formatting, as shown for rule count
above:
The value for every rule is on a new, indented line.
Every item in input
and output
is trailed by a comma.2
Furthermore, I have added some Python multi-line (triple-quote) comments explaining what the Snakefile and each individual rule does.
snakemake --dag | dot -T svg > workflow/DAGs/v1.svg
Below, Snakemake is being run with the -p
(print shell commands) and -q
(quiet) options for concise output. To better understand what Snakemake is doing, you may want to see more output yourself, e.g. with the -pr
combination of options (-r
: reason for execution of each job).
Also, recall that you can always do a “dry run” with the -n
option to see what Snakemake wants to execute.
snakemake -j1 -pq
#> Job counts:
#> count jobs
#> 1 all
#> 1 count
#> 3 map
#> 3 trim
#> 8
#> scripts/trim.sh data/smpG.fastq > res/smpG_trim.fastq
#> scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#> scripts/trim.sh data/smpC.fastq > res/smpC_trim.fastq
#> scripts/trim.sh data/smpA.fastq > res/smpA_trim.fastq
#> scripts/map.sh res/smpA_trim.fastq > res/smpA.bam
#> scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam > res/count_table.txt
Currently, our map
and count
rules don’t include any genomic reference files: mapping should happen against a reference genome (FASTA file), and counting needs an annotation for the reference genome (GTF/GFF file).
For now, we will fix this oversight just for the mapping rule. Let’s create a dummy genome FASTA file:
mkdir metadata
echo "AAAAAAACCCCCCCGGGGGGG" > metadata/ref.fa
We’ll add a line to the map.sh
script, such that the second argument passed to it is now the FASTA file with the reference genome:
echo 'echo "Using ref genome: $2" && cat $2' >> scripts/map.sh
To do so, you’ll need to do the following in the map
rule:
Pass the reference FASTA file to the map.sh
script in the shell
directive.
While not strictly necessary to avoid errors, you should include the reference FASTA file in the input
directive.
Because you will now have multiple items for input
, it will be good to name them – the syntax for that is: input: myname=myfile
, and you can then recall this using {input.myname}
in the shell
directive.
Moreover, it is good practice to define hard-coded input files like this one clearly at the top of your script/Snakefile (or use a configuration file.
Assign the filename somewhere near the top:
REF_FA = "metadata/ref.fa"
Adjust rule map
:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input.fastq} {input.ref} > {output}"
Snakemake will indeed rerun the rule map
and rule count
.
It will rerun rule map
because one of the input files, the reference FASTA, is newer than the output files (we just created it!).
Then, it will rerun rule count
because that depends on the output files of rule map
, which have just changed.
If we hadn’t just created the reference FASTA file, Snakemake would not have rerun: it will not rerun based on changes in the Snakefile code.
(However, it will keep track of code changes and you can force a rerun based on files affected by such changes using:
snakemake -j1 -R $(snakemake --list-code-changes)
)
To test this without actually rerunning anything, we can use the -n
flag to do a dry-run. Moreover, if we also use the -r
flag, Snakemake will tell us the reason for the execution of every job.
snakemake -j1 -nr
# [...]
[Mon Apr 5 21:14:35 2021]
rule map:
input: res/smpC_trim.fastq, metadata/ref.fa
output: res/smpC.bam
jobid: 2
reason: Updated input files: metadata/ref.fa # "REASON" GIVEN HERE
wildcards: smp=smpC
# [...]
snakemake -j1 -pq
#> Job counts:
#> count jobs
#> 1 all
#> 1 count
#> 3 map
#> 5
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa > res/smpA.bam
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa > res/smpG.bam
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa > res/smpC.bam
#> scripts/count.sh res/smpC.bam res/smpA.bam res/smpG.bam > res/count_table.txt
As you may recall, FastQC performs quality control of FASTQ files (see these slides from week 6). FastQC should be run on the raw and/or trimmed FASTQ files. If you want, you can implement both, but the solution below will do so just for the raw FASTQ files.
scripts/fastqc.sh
:#!/bin/bash
set -e -u -o pipefail
infile=$1
outdir=$2
outfile="$outdir"/$(basename "$infile" .fastq).fastqc.html
echo "Running FastQC for file: $infile , to output dir: $outdir"
echo "FastQC results for file $infile" > "$outfile"
cat "$infile" > "$outfile"
scripts/fastqc.sh
executable.
chmod u+x scripts/fastqc.sh
Figure out what the script does and create your rule fastqc
accordingly.
As the output dir, you can just use the res
(results) dir we have been using, or optionally a subdir of that.
The script will write to an output file <output-dir>/<sample>.fastqc.html
. That is, you don’t specify the output file to the script, only the output directory. This is like the behavior of the actual FastQC program.
The script will also print a line to standard out (i.e., not to the output file).
In the Snakefile rule, you do want an output
directive, but you don’t reference {output}
in the shell
directive. Instead, you just provide the output dir as the second argument to the script.
rule fastqc:
input:
"data/{smp}.fastq",
output:
"res/{smp}.fastqc.html",
shell:
"scripts/fastqc.sh {input} res"
What will happen now if you run Snakemake?
(Without providing it a rule or file to produce.)
Think about what the answer should be before you try.
It will not run anything:
snakemake -j1 -pr
#> Building DAG of jobs...
#> Nothing to be done.
This is because the FastQC output files are not being used by any other rule, nor are they listed in rule all
.
What will happen if you ask Snakemake to specifically run the FastQC rule?
Think about what the answer should be before you try.
You will get a WorkflowError
:
snakemake -j1 -pr fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.
The target rule
referenced in the error is fastqc
, which we targeted in the command-line call.
The problem is that rule fastqc
contains wildcards and Snakemake can’t resolve what their values should be. This may be surprising but recall that Snakemake works backwards in the sense that it looks at output files first (rather than first inferring input files from existing files), and here, it does not know which output files to produce.
You could specify a specific output file for one of the samples, e.g. res/smpA.fastqc.html
:
snakemake -j1 -pq res/smpA.fastqc.html
#> Job counts:
#> count jobs
#> 1 fastqc
#> 1
#> scripts/fastqc.sh data/smpA.fastq res
#> Running FastQC for file: data/smpA.fastq , to output dir: res
Note above that the line the script printed to standard output (Running FastQC for file: $infile , to output dir: $outdir
) is indeed simply printed to standard out among the other output of Snakemake.
You should include the output files of the FastQC script in the input
of rule all
, using the expand()
function.
Add a line to rule all
so that it becomes:
rule all:
input:
"res/count_table.txt",
expand("res/{smp}.fastqc.html", smp=SAMPLES)
Rerun Snakemake:
snakemake -pq
#> Job counts:
#> count jobs
#> 1 all
#> 2 fastqc
#> 3
#> scripts/fastqc.sh data/smpC.fastq res
#> Running FastQC for file: data/smpC.fastq , to output dir: res
#> scripts/fastqc.sh data/smpG.fastq res
#> Running FastQC for file: data/smpG.fastq , to output dir: res
Trigger an error in one of the shell scripts:
rule map
by removing {input.ref}
from the shell
directive.rule map
you just made.
Remove the output files:
rm res/*
Rule map
should look as follows:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input.fastq} > {output}"
Run the workflow:
snakemake -j1 -pr
#> Error in rule map:
#> jobid: 2
#> output: res/smpG.bam
#> shell:
#> scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#> (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
#>
#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bam
#> scripts/map.sh: line 6: $2: unbound variable
#> [Fri Apr 2 18:33:26 2021]
#> Error in rule map:
#> jobid: 4
#> output: res/smpC.bam
#> shell:
#> scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#> (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
#>
#> Removing output files of failed job map since they might be corrupted:
#> res/smpC.bam
#> Shutting down, this might take some time.
#> Exiting because a job execution failed. Look above for error message
Undo the change:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input.fastq} {input.ref} > {output}"
The actual error produced by the bash script is what you would likely be interested in if troubleshooting:
scripts/map.sh: line 6: $2: unbound variable.
However, this was buried among all the other Snakemake output. Earlier, we saw that a line of standard output was similarly printed to screen among the Snakemake output. And the same would be true for logging information and errors produced by programs like the actual FastQC.
This is not optimal when errors do occur – in the next section, we will therefore use log files.
map
?
Nope! These were removed by Snakemake because of the errors. You may have noticed this in the error messages above, where it said, for example:
#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bam
log
directiveIt would be better to save logging and error output to “log” files for each rule – or each job within a rule, more precisely.
For example, we may be inclined to do the following for a script/program that:
Saves output data to file, and prints logging/errors to screen (standard out and/or standard error – recall that &>
redirects both):
"myscript.py -i {input} -o {output} &> log/myscript.log" shell:
Prints output data to standard out, and logging/errors to standard error (recall that 2>
redirects standard error):
"myscript.py -i {input} > {output} 2> log/myscript.log" shell:
But should we also tell Snakemake explicitly about such a file, like we do for input
and output
files? Yes, but it’s best to use a separate directive for this: log
.
This is convenient because Snakemake treats log
files differently than regular output files: if a job fails, Snakemake will not delete the log files. As we saw above, Snakemake does delete regular output files by jobs that produced errors, so if we were to designate logging output as regular output, we would not be able to examine our errors.
Example usage of a log
key – note that we can use any wildcards that are also used in {output}
:
rule myscript:#[...]
/{sample}.out
output: res/myscript_{sample}.log
log: log"myscript.py -i {input} > {output} 2> {log}" shell:
Add a log
directive to each rule and rerun Snakemake.
You’ll also have to redirect standard out and/or standard error (this will depend on the rule!) to the log file in each rule.
Use a different dir for logging output than the res
dir, e.g. log
, perhaps best with subdirs for each rule.
After you have run Snakemake, check that you now have a log file for each job – though most will be empty in the absence of errors; only the logs for the fastqc
rule should contain something.
SAMPLES = glob_wildcards("data/{smp}.fastq").smp
REF_FA = "metadata/ref.fa"
rule all:
input:
"res/count_table.txt",
expand("res/{smp}.fastqc.html", smp=SAMPLES)
rule trim:
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
log:
"log/trim/{smp}.log",
shell:
"scripts/trim.sh {input} >{output} 2>{log}"
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
rule count:
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
log:
"log/count/count.log",
shell:
"scripts/count.sh {input} >{output} 2>{log}"
rule fastqc:
input:
"data/{smp}.fastq",
output:
"res/{smp}.fastqc.html",
log:
"log/fastqc/{smp}.log",
shell:
"scripts/fastqc.sh {input} res &>{log}"
rm res/*
snakemake -j1 -pq
#> Job counts:
#> count jobs
#> 1 all
#> 1 count
#> 3 fastqc
#> 3 map
#> 3 trim
#> 11
#> scripts/fastqc.sh data/smpG.fastq res &>log/fastqc/smpG.log
#> scripts/fastqc.sh data/smpA.fastq res &>log/fastqc/smpA.log
#> scripts/trim.sh data/smpG.fastq >res/smpG_trim.fastq 2>log/trim/smpG.log
#> scripts/trim.sh data/smpC.fastq >res/smpC_trim.fastq 2>log/trim/smpC.log
#> scripts/trim.sh data/smpA.fastq >res/smpA_trim.fastq 2>log/trim/smpA.log
#> scripts/fastqc.sh data/smpC.fastq res &>log/fastqc/smpC.log
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa >res/smpA.bam 2>log/map/smpA.log
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa >res/smpG.bam 2>log/map/smpG.log
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa >res/smpC.bam 2>log/map/smpC.log
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam >res/count_table.txt #> 2>log/count/count.log
tree log
#> log
#> ├── count
#> │ └── count.log
#> ├── fastqc
#> │ ├── smpA.log
#> │ ├── smpC.log
#> │ └── smpG.log
#> ├── map
#> │ ├── smpA.log
#> │ ├── smpC.log
#> │ └── smpG.log
#> └── trim
#> ├── smpA.log
#> ├── smpC.log
#> └── smpG.log
fastqc
log files should have nonzero file sizes, which you can quickly check by running ls -lhR log
.Since we are running dummy scripts, it has been fine to run Snakemake on a login node at OSC. For any real workflow, this would not work.
You have two main options:
For small workflows that can be run on one node or less than that, you can run Snakemake as a single SLURM job, either on an interactive node (especially if the workflow is particularly small) or by submitting a script with sbatch
.
Snakemake can orchestrate SLURM job submissions for you, in which case every Snakemake job (every instance of every rule) will be submitted as a separate job.
Jobs that depend on others will only be allowed to run when those other jobs have successfully finished, so you can submit a very complex and potentially long-running workflow with a single Snakemake call!
Because the main Snakemake process will keep running as long as one of its constituent jobs is, you will still need to run Snakemake itself as a SLURM job too. Even though this process may not take substantial computing resources, it may have to run for long time – and recall that at OSC, any process that runs for more than 20 minutes on a login node is killed.
Write a Bash script with SLURM instructions to run Snakemake.
The job should request 1 hour, 1 task, 1 core, should be billed to the classroom project PAS1855
, and it should run our current Snakefile.
Save the script as workflow/snakemake_submit.sh
.
#!/bin/bash
#SBATCH --account=PAS1855
#SBATCH --time=60
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
set -e -u -o pipefail
snakemake -j1 -p
So far, we have been using -j1
(i.e. --jobs 1
), such that Snakemake will only run one process at a time. Several of our rules run the same script separately for each sample, so those jobs could be run at the same time (in parallel).
Modify the bash script to parallellize the workflow.
How many simultaneous jobs (processes) would it make sense to run at the same time?
Which SLURM directive corresponds to Snakemake’s -j
flag: --ntasks
or --cpus-per-task
?
How could you make sure there is no mismatch between the resources requested from SLURM and the number of jobs provided to Snakemake?
For the last question: you should use a SLURM environment variables. The environment variable in question is $SLURM_NTASKS
.
It would make sense to run 6 jobs at the same time:
Our workflow currently runs 3 samples, with several rules therefore creating 3 separate jobs that could be run in parallel.
Because rule fastqc
and rule map
both operate on the raw FASTQ independently, they can also be run in parallel.
Note that you could easily forego this assessment at OSC and just set a maximum that you think is fair, like a 100 jobs – but it is a useful exercise to think about this all the same.
The --ntasks
directive effectively corresponds to Snakemake’s jobs (-j
):
each SLURM “task” is a separate process.
If we would instead increase --cpus-per-task
, we would just allocate more CPUs/cores/threads to each individual process, but we want to run multiple processes in parallel.
#!/bin/bash
#SBATCH --account=PAS1855
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --cpus-per-task=1
set -e -u -o pipefail
snakemake -j"$SLURM_NTASKS" -p
slurm-<job-id>.out
) to see if it worked.
rm -r res log/*
sbatch workflow/snakemake_submit.sh
At its most basic, all you need to do is add --cluster sbatch
to the Snakemake command, and Snakemake will automatically submit a SLURM job for each of its jobs. But recall that at OSC, we also always need to specify an account.
So, we could use the following to have Snakemake submit at most 3 jobs at a time:
snakemake -j6 --cluster "sbatch --account=PAS1855"
For simple and quick workflows, providing SLURM options on the command line like we did above is reasonable, but if we have several non-default options, this will start to become a painfully long command.
For instance, you may note that the SLURM log files have currently all ended up in your project’s root dir, where you probably don’t want them.
We can also specify these options in a configuration file using a “profile”. This will additionally make it easier to customize SLURM options on a per-rule basis.
To use a profile, we should first create a directory. The name of this directory is arbitrary, but since we are using it in the context of providing settings for SLURM jobs, something like slurm_profile
would make sense:
mkdir slurm_profile
Inside this directory, Snakemake will expect a file called config.yaml
:
touch slurm_profile/config.yaml
In config.yaml
, we can provide a string for the cluster
key, much like we did previously with the --cluster
argument on the command-line:
cluster: "sbatch --account={resources.account}
--mem={resources.mem_mb}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [mem_mb=4000, account=PAS1855]
Above, we use {resources.<resource-name>}
instead of actual values in the cluster
key. Then, the values for each <resource-name>
are specified for the default-resources
key. This setup is convenient because it allows us to also refer to the same resources in the Snakefile to set rule-specific values, as we’ll see a little later.
Create the profile directory and config file as shown above, and additionally specify requests for time (in minutes) and number of CPUs.
Note that these resource requests are now a on a per-job basis, not for the whole workflow combined.
Contents of slurm_profile/config.yaml
:
cluster: "sbatch --account={resources.account}
--time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
To run Snakemake with the profile, we need to use the option --profile
, and specify our profile dir: --profile <profile_dir>
.
rm -r res log/*
snakemake -j6 --profile slurm_profile
Above, we also specified an output dir and names for our SLURM log files. If this directory does not already exist, make sure you create it manually, because Snakemake will not create it, and mysterious failures will occur!
So, Snakemake will create directories specified in the Snakefile as output, but won’t create directories that are only mentioned in this config file.
config.yaml
can contain not just cluster settings, but anything that can be set with command-line options.
We can take that opportunity to also:
Specify the number of jobs (--jobs
option).
Make Snakemake wait longer for output files to appear using the latency-wait
option: units are in seconds and we want 30 second.
(I’ve had errors with shorter latency times.)
Add key-value pairs for the number of jobs and the latency to the config file.
Note that in YAML, key-value pairs are specified as key:value
, not --key value
.
jobs: 100
latency-wait: 30
The full slurm_profile/config.yaml
should now contain the following:
cluster: "sbatch --account={resources.account}
--time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
jobs: 100
latency-wait: 30
Run Snakemake with the updated profile.
Recall that now, you don’t have to specify -j
/ --jobs
at the command line.
Pay attention to the job submission pattern of Snakemake. You can also open up another terminal and monitor your jobs:
squeue -u $USER -l
rm -r res log/*
snakemake --profile slurm_profile
In practice, you may not want to submit a cluster job for every rule.
For instance, rule all
is a Snakemake job, but it doesn’t actually run anything the way we have set it up. Additionally, you may have some very lightweight cleaning/logging rules.
To tell Snakemake that certain rules should not be submitted to the cluster, include a comma-separated list of rules near the top of your Snakefile with the localrules
key:
all, clean localrules:
rule all
, and run Snakemake again.
Your full Snakefile should now contain the following:
SAMPLES = glob_wildcards("data/{smp}.fastq").smp
REF_FA = "metadata/ref.fa"
localrules: all
rule all:
input:
"res/count_table.txt",
expand("res/{smp}.fastqc.html", smp=SAMPLES)
rule trim:
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
log:
"log/trim/{smp}.log",
shell:
"scripts/trim.sh {input} >{output} 2>{log}"
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
rule count:
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
log:
"log/count/count.log",
shell:
"scripts/count.sh {input} >{output} 2>{log}"
rule fastqc:
input:
"data/{smp}.fastq",
output:
"res/{smp}.fastqc.html",
log:
"log/fastqc/{smp}.log",
shell:
"scripts/fastqc.sh {input} res &>{log}"
rm -r res log/*
snakemake --profile slurm_profile
Because the number of threads is a parameter that commonly differs between rules, Snakemake has a threads
directive that we can add to any rule. This will tell Snakemake how many threads it should use for any single job that the rule in question generates.
It is either --cpus-per-task
or --ntasks
.
The threads
directive in a Snakefile corresponds to the --cpus-per-task
SLURM directive. (Recall that SLURM uses the terms CPU, cores and threads roughly equivalently.)
When specifying multiple --cpus-per-task
(SLURM) / threads
(Snakemake), you are allowing a single process (task) to perform multithreading.
When using the threads directive, we will generally also tell the program itself about the number of threads – see the example below, where we are calling the STAR aligner directly in a Snakemake rule:
rule STAR:8
threads: "STAR --runThreadN {threads} ..." shell:
For now, we’ll keep it simple and merely use the threads
directive:
rule map
.
Rule map should now contain the following:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
threads: 4
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
We can use a resources
key for any rule to specify (mostly) arbitrary key-value pairs with resources:
rule heavy_stuff:# [...]
=50000
resources: mem_mb# [...]
These are arbitrary in the sense that mem_mb
will not directly set the actual maximum memory usage in the way that the threads
directive did set the number of threads. Instead, they refer to the keys we use in our config.yaml
.
For example, to actually request 50,000 MB of memory for rule heavy_stuff
, mem_mb
should correspond to the term used in the SLURM resource request in config.yaml
:
cluster: "sbatch --mem={resources.mem_mb} ..."
default-resources: [mem_mb=4000, ...]
Setting resources: mem_mb=50000
for rule heavy_stuff
will then override the default value of 4000
, and pass that on the SLURM job request!
rule map
to 2 hours.
Assuming time is referred to as time_min
in slurm_profile/config.yaml
(this has to match!), rule map
should now contain the following:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
threads: 8
resources: time_min=120
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
There are a few different options:
You can look up (go to “input environment variables”) and echo
SLURM environment variables in the script. For the time limit specifically, this variable is $SBATCH_TIMELIMIT
, so you could include the following in your scripts/map.sh
:
echo "Time limit: $SBATCH_TIMELIMIT"
You can print an overview of resources and statistics for your job using scontrol show job <job-ID>
. You can do that manually in the shell for a specific SLURM job, or include the following in your script, to have this information include in the SLURM log file:
scontrol show job $SLURM_JOB_ID
The allotted time specifically is also shown when you issue squeue
with the -l
(long) option:
squeue -u $USER -l
Note that it doesn’t seem to be possible to install this extension in VS Code Server at OSC, but you can install it locally.↩︎
This may look a little strange (but think 1-element tuple!), yet is somewhat useful: when every line has a trailing comma, you can always add and remove lines without paying attention to whether a new item follows.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".