In these exercises, you’ll take up where we left off in class, using a dummy RNA-seq workflow.
Load the Conda environment in which you have Snakemake installed:
$ module load python/3.6-conda5.2
$ source activate ipy-env
# Check if Snakemake can be found and is inside your ipy-env:
$ which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemakeWe will recreate a very similar set of files and the final Snakefile from class – copy the files from /fs/ess/PAS1855/exercises/week13/ to do so:
$ cd /fs/ess/PAS1855/users/$USER/week13
$ cp -r /fs/ess/PAS1855/exercises/week13/exercise1 .
$ cd exercise1$ tree
#> .
#> ├── data
#> │ ├── smpA.fastq
#> │ ├── smpC.fastq
#> │ └── smpG.fastq
#> ├── res
#> ├── scripts
#> │ ├── count.sh
#> │ ├── map.sh
#> │ └── trim.sh
#> └── workflow
#> ├── DAGs
#> └── Snakefile
Create the dir structure and the dummy FASTQ files:
mkdir -p data scripts res workflow/DAGs
echo "AAAAAAA" > data/smpA.fastq
echo "CCCCCCC" > data/smpC.fastq
echo "GGGGGGG" > data/smpG.fastqCreate a script scripts/trim.sh containing:
#!/bin/bash
set -e -u -o pipefail
echo "FASTQ file: $1 after trimming"
cat "$1"Create a script scripts/map.sh containing:
#!/bin/bash
set -e -u -o pipefail
echo "BAM from FASTQ file: $1"
cat "$1"Create a script scripts/count.sh containing:
#!/bin/bash
set -e -u -o pipefail
echo "Counts for $# BAM files:"
cat $@Make the scripts executable:
chmod +x scripts/*Create a file workflow/Snakefile and paste into it:
"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.
"""
SAMPLES=glob_wildcards("data/{smp}.fastq").smp
rule all:
"""
This is a "pseudo-rule" meant to trigger the execution of the full workflow.
"""
input:
"res/count_table.txt",
rule trim:
"""
Trim the FASTQ files.
"""
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
shell:
"scripts/trim.sh {input} > {output}"
rule map:
"""
Map the trimmed FASTQ files to a reference genome.
"""
input:
"res/{smp}_trim.fastq",
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input} > {output}"
rule count:
"""
From the BAM files, create a gene count table.
"""
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
shell:
"scripts/count.sh {input} > {output}"Currently, our Snakefile workflow/Snakefile contains the following:
"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.
"""
SAMPLES = glob_wildcards("data/{smp}.fastq").smp
rule all:
"""
This is a "pseudo-rule" meant to trigger the execution of the full workflow.
"""
input:
"res/count_table.txt",
rule trim:
"""
Trim the FASTQ files.
"""
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
shell:
"scripts/trim.sh {input} > {output}"
rule map:
"""
Map the trimmed FASTQ files to a reference genome.
"""
input:
"res/{smp}_trim.fastq",
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input} > {output}"
rule count:
"""
From the BAM files, create a gene count table.
"""
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
shell:
"scripts/count.sh {input} > {output}"Note that some formatting details in the Snakefile above are slightly different from what was shown in class. I used the Snakefmt VS Code extension, which lets a little program of the same name format Snakefiles for you.1
rule count:
"""
From the BAM files, create a gene count table.
"""
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
shell:
"scripts/count.sh {input} > {output}"Snakefmt will detect syntax errors and ensure consistent and clear formatting, as shown for rule count above:
The value for every rule is on a new, indented line.
Every item in input and output is trailed by a comma.2
Furthermore, I have added some Python multi-line (triple-quote) comments explaining what the Snakefile and each individual rule does.
snakemake --dag | dot -T svg > workflow/DAGs/v1.svg
Below, Snakemake is being run with the -p (print shell commands) and -q (quiet) options for concise output. To better understand what Snakemake is doing, you may want to see more output yourself, e.g. with the -pr combination of options (-r: reason for execution of each job).
Also, recall that you can always do a “dry run” with the -n option to see what Snakemake wants to execute.
snakemake -j1 -pq
#> Job counts:
#> count jobs
#> 1 all
#> 1 count
#> 3 map
#> 3 trim
#> 8
#> scripts/trim.sh data/smpG.fastq > res/smpG_trim.fastq
#> scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#> scripts/trim.sh data/smpC.fastq > res/smpC_trim.fastq
#> scripts/trim.sh data/smpA.fastq > res/smpA_trim.fastq
#> scripts/map.sh res/smpA_trim.fastq > res/smpA.bam
#> scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam > res/count_table.txtCurrently, our map and count rules don’t include any genomic reference files: mapping should happen against a reference genome (FASTA file), and counting needs an annotation for the reference genome (GTF/GFF file).
For now, we will fix this oversight just for the mapping rule. Let’s create a dummy genome FASTA file:
mkdir metadata
echo "AAAAAAACCCCCCCGGGGGGG" > metadata/ref.faWe’ll add a line to the map.sh script, such that the second argument passed to it is now the FASTA file with the reference genome:
echo 'echo "Using ref genome: $2" && cat $2' >> scripts/map.sh
To do so, you’ll need to do the following in the map rule:
Pass the reference FASTA file to the map.sh script in the shell directive.
While not strictly necessary to avoid errors, you should include the reference FASTA file in the input directive.
Because you will now have multiple items for input, it will be good to name them – the syntax for that is: input: myname=myfile, and you can then recall this using {input.myname} in the shell directive.
Moreover, it is good practice to define hard-coded input files like this one clearly at the top of your script/Snakefile (or use a configuration file.
Assign the filename somewhere near the top:
REF_FA = "metadata/ref.fa"Adjust rule map:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input.fastq} {input.ref} > {output}"Snakemake will indeed rerun the rule map and rule count.
It will rerun rule map because one of the input files, the reference FASTA, is newer than the output files (we just created it!).
Then, it will rerun rule count because that depends on the output files of rule map, which have just changed.
If we hadn’t just created the reference FASTA file, Snakemake would not have rerun: it will not rerun based on changes in the Snakefile code.
(However, it will keep track of code changes and you can force a rerun based on files affected by such changes using:
snakemake -j1 -R $(snakemake --list-code-changes))
To test this without actually rerunning anything, we can use the -n flag to do a dry-run. Moreover, if we also use the -r flag, Snakemake will tell us the reason for the execution of every job.
snakemake -j1 -nr
# [...]
[Mon Apr 5 21:14:35 2021]
rule map:
input: res/smpC_trim.fastq, metadata/ref.fa
output: res/smpC.bam
jobid: 2
reason: Updated input files: metadata/ref.fa # "REASON" GIVEN HERE
wildcards: smp=smpC
# [...] snakemake -j1 -pq
#> Job counts:
#> count jobs
#> 1 all
#> 1 count
#> 3 map
#> 5
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa > res/smpA.bam
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa > res/smpG.bam
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa > res/smpC.bam
#> scripts/count.sh res/smpC.bam res/smpA.bam res/smpG.bam > res/count_table.txtAs you may recall, FastQC performs quality control of FASTQ files (see these slides from week 6). FastQC should be run on the raw and/or trimmed FASTQ files. If you want, you can implement both, but the solution below will do so just for the raw FASTQ files.
scripts/fastqc.sh:#!/bin/bash
set -e -u -o pipefail
infile=$1
outdir=$2
outfile="$outdir"/$(basename "$infile" .fastq).fastqc.html
echo "Running FastQC for file: $infile , to output dir: $outdir"
echo "FastQC results for file $infile" > "$outfile"
cat "$infile" > "$outfile"scripts/fastqc.sh executable.
chmod u+x scripts/fastqc.shFigure out what the script does and create your rule fastqc accordingly.
As the output dir, you can just use the res (results) dir we have been using, or optionally a subdir of that.
The script will write to an output file <output-dir>/<sample>.fastqc.html. That is, you don’t specify the output file to the script, only the output directory. This is like the behavior of the actual FastQC program.
The script will also print a line to standard out (i.e., not to the output file).
In the Snakefile rule, you do want an output directive, but you don’t reference {output} in the shell directive. Instead, you just provide the output dir as the second argument to the script.
rule fastqc:
input:
"data/{smp}.fastq",
output:
"res/{smp}.fastqc.html",
shell:
"scripts/fastqc.sh {input} res"What will happen now if you run Snakemake?
(Without providing it a rule or file to produce.)
Think about what the answer should be before you try.
It will not run anything:
snakemake -j1 -pr
#> Building DAG of jobs...
#> Nothing to be done.This is because the FastQC output files are not being used by any other rule, nor are they listed in rule all.
What will happen if you ask Snakemake to specifically run the FastQC rule?
Think about what the answer should be before you try.
You will get a WorkflowError:
snakemake -j1 -pr fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.The target rule referenced in the error is fastqc, which we targeted in the command-line call.
The problem is that rule fastqc contains wildcards and Snakemake can’t resolve what their values should be. This may be surprising but recall that Snakemake works backwards in the sense that it looks at output files first (rather than first inferring input files from existing files), and here, it does not know which output files to produce.
You could specify a specific output file for one of the samples, e.g. res/smpA.fastqc.html:
snakemake -j1 -pq res/smpA.fastqc.html
#> Job counts:
#> count jobs
#> 1 fastqc
#> 1
#> scripts/fastqc.sh data/smpA.fastq res
#> Running FastQC for file: data/smpA.fastq , to output dir: resNote above that the line the script printed to standard output (Running FastQC for file: $infile , to output dir: $outdir) is indeed simply printed to standard out among the other output of Snakemake.
You should include the output files of the FastQC script in the input of rule all, using the expand() function.
Add a line to rule all so that it becomes:
rule all:
input:
"res/count_table.txt",
expand("res/{smp}.fastqc.html", smp=SAMPLES)Rerun Snakemake:
snakemake -pq
#> Job counts:
#> count jobs
#> 1 all
#> 2 fastqc
#> 3
#> scripts/fastqc.sh data/smpC.fastq res
#> Running FastQC for file: data/smpC.fastq , to output dir: res
#> scripts/fastqc.sh data/smpG.fastq res
#> Running FastQC for file: data/smpG.fastq , to output dir: resTrigger an error in one of the shell scripts:
rule map by removing {input.ref} from the shell directive.rule map you just made.
Remove the output files:
rm res/*Rule map should look as follows:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input.fastq} > {output}"Run the workflow:
snakemake -j1 -pr
#> Error in rule map:
#> jobid: 2
#> output: res/smpG.bam
#> shell:
#> scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#> (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
#>
#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bam
#> scripts/map.sh: line 6: $2: unbound variable
#> [Fri Apr 2 18:33:26 2021]
#> Error in rule map:
#> jobid: 4
#> output: res/smpC.bam
#> shell:
#> scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#> (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
#>
#> Removing output files of failed job map since they might be corrupted:
#> res/smpC.bam
#> Shutting down, this might take some time.
#> Exiting because a job execution failed. Look above for error messageUndo the change:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
shell:
"scripts/map.sh {input.fastq} {input.ref} > {output}"The actual error produced by the bash script is what you would likely be interested in if troubleshooting:
scripts/map.sh: line 6: $2: unbound variable.However, this was buried among all the other Snakemake output. Earlier, we saw that a line of standard output was similarly printed to screen among the Snakemake output. And the same would be true for logging information and errors produced by programs like the actual FastQC.
This is not optimal when errors do occur – in the next section, we will therefore use log files.
map?
Nope! These were removed by Snakemake because of the errors. You may have noticed this in the error messages above, where it said, for example:
#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bamlog directiveIt would be better to save logging and error output to “log” files for each rule – or each job within a rule, more precisely.
For example, we may be inclined to do the following for a script/program that:
Saves output data to file, and prints logging/errors to screen (standard out and/or standard error – recall that &> redirects both):
shell: "myscript.py -i {input} -o {output} &> log/myscript.log"Prints output data to standard out, and logging/errors to standard error (recall that 2> redirects standard error):
shell: "myscript.py -i {input} > {output} 2> log/myscript.log"But should we also tell Snakemake explicitly about such a file, like we do for input and output files? Yes, but it’s best to use a separate directive for this: log.
This is convenient because Snakemake treats log files differently than regular output files: if a job fails, Snakemake will not delete the log files. As we saw above, Snakemake does delete regular output files by jobs that produced errors, so if we were to designate logging output as regular output, we would not be able to examine our errors.
Example usage of a log key – note that we can use any wildcards that are also used in {output}:
rule myscript:
#[...]
output: res/{sample}.out
log: log/myscript_{sample}.log
shell: "myscript.py -i {input} > {output} 2> {log}"Add a log directive to each rule and rerun Snakemake.
You’ll also have to redirect standard out and/or standard error (this will depend on the rule!) to the log file in each rule.
Use a different dir for logging output than the res dir, e.g. log, perhaps best with subdirs for each rule.
After you have run Snakemake, check that you now have a log file for each job – though most will be empty in the absence of errors; only the logs for the fastqc rule should contain something.
SAMPLES = glob_wildcards("data/{smp}.fastq").smp
REF_FA = "metadata/ref.fa"
rule all:
input:
"res/count_table.txt",
expand("res/{smp}.fastqc.html", smp=SAMPLES)
rule trim:
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
log:
"log/trim/{smp}.log",
shell:
"scripts/trim.sh {input} >{output} 2>{log}"
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
rule count:
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
log:
"log/count/count.log",
shell:
"scripts/count.sh {input} >{output} 2>{log}"
rule fastqc:
input:
"data/{smp}.fastq",
output:
"res/{smp}.fastqc.html",
log:
"log/fastqc/{smp}.log",
shell:
"scripts/fastqc.sh {input} res &>{log}"rm res/*
snakemake -j1 -pq
#> Job counts:
#> count jobs
#> 1 all
#> 1 count
#> 3 fastqc
#> 3 map
#> 3 trim
#> 11
#> scripts/fastqc.sh data/smpG.fastq res &>log/fastqc/smpG.log
#> scripts/fastqc.sh data/smpA.fastq res &>log/fastqc/smpA.log
#> scripts/trim.sh data/smpG.fastq >res/smpG_trim.fastq 2>log/trim/smpG.log
#> scripts/trim.sh data/smpC.fastq >res/smpC_trim.fastq 2>log/trim/smpC.log
#> scripts/trim.sh data/smpA.fastq >res/smpA_trim.fastq 2>log/trim/smpA.log
#> scripts/fastqc.sh data/smpC.fastq res &>log/fastqc/smpC.log
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa >res/smpA.bam 2>log/map/smpA.log
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa >res/smpG.bam 2>log/map/smpG.log
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa >res/smpC.bam 2>log/map/smpC.log
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam >res/count_table.txt #> 2>log/count/count.logtree log
#> log
#> ├── count
#> │ └── count.log
#> ├── fastqc
#> │ ├── smpA.log
#> │ ├── smpC.log
#> │ └── smpG.log
#> ├── map
#> │ ├── smpA.log
#> │ ├── smpC.log
#> │ └── smpG.log
#> └── trim
#> ├── smpA.log
#> ├── smpC.log
#> └── smpG.logfastqc log files should have nonzero file sizes, which you can quickly check by running ls -lhR log.Since we are running dummy scripts, it has been fine to run Snakemake on a login node at OSC. For any real workflow, this would not work.
You have two main options:
For small workflows that can be run on one node or less than that, you can run Snakemake as a single SLURM job, either on an interactive node (especially if the workflow is particularly small) or by submitting a script with sbatch.
Snakemake can orchestrate SLURM job submissions for you, in which case every Snakemake job (every instance of every rule) will be submitted as a separate job.
Jobs that depend on others will only be allowed to run when those other jobs have successfully finished, so you can submit a very complex and potentially long-running workflow with a single Snakemake call!
Because the main Snakemake process will keep running as long as one of its constituent jobs is, you will still need to run Snakemake itself as a SLURM job too. Even though this process may not take substantial computing resources, it may have to run for long time – and recall that at OSC, any process that runs for more than 20 minutes on a login node is killed.
Write a Bash script with SLURM instructions to run Snakemake.
The job should request 1 hour, 1 task, 1 core, should be billed to the classroom project PAS1855, and it should run our current Snakefile.
Save the script as workflow/snakemake_submit.sh.
#!/bin/bash
#SBATCH --account=PAS1855
#SBATCH --time=60
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
set -e -u -o pipefail
snakemake -j1 -pSo far, we have been using -j1 (i.e. --jobs 1), such that Snakemake will only run one process at a time. Several of our rules run the same script separately for each sample, so those jobs could be run at the same time (in parallel).
Modify the bash script to parallellize the workflow.
How many simultaneous jobs (processes) would it make sense to run at the same time?
Which SLURM directive corresponds to Snakemake’s -j flag: --ntasks or --cpus-per-task?
How could you make sure there is no mismatch between the resources requested from SLURM and the number of jobs provided to Snakemake?
For the last question: you should use a SLURM environment variables. The environment variable in question is $SLURM_NTASKS.
It would make sense to run 6 jobs at the same time:
Our workflow currently runs 3 samples, with several rules therefore creating 3 separate jobs that could be run in parallel.
Because rule fastqc and rule map both operate on the raw FASTQ independently, they can also be run in parallel.
Note that you could easily forego this assessment at OSC and just set a maximum that you think is fair, like a 100 jobs – but it is a useful exercise to think about this all the same.
The --ntasks directive effectively corresponds to Snakemake’s jobs (-j):
each SLURM “task” is a separate process.
If we would instead increase --cpus-per-task, we would just allocate more CPUs/cores/threads to each individual process, but we want to run multiple processes in parallel.
#!/bin/bash
#SBATCH --account=PAS1855
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --cpus-per-task=1
set -e -u -o pipefail
snakemake -j"$SLURM_NTASKS" -p slurm-<job-id>.out) to see if it worked.
rm -r res log/*
sbatch workflow/snakemake_submit.shAt its most basic, all you need to do is add --cluster sbatch to the Snakemake command, and Snakemake will automatically submit a SLURM job for each of its jobs. But recall that at OSC, we also always need to specify an account.
So, we could use the following to have Snakemake submit at most 3 jobs at a time:
snakemake -j6 --cluster "sbatch --account=PAS1855"For simple and quick workflows, providing SLURM options on the command line like we did above is reasonable, but if we have several non-default options, this will start to become a painfully long command.
For instance, you may note that the SLURM log files have currently all ended up in your project’s root dir, where you probably don’t want them.
We can also specify these options in a configuration file using a “profile”. This will additionally make it easier to customize SLURM options on a per-rule basis.
To use a profile, we should first create a directory. The name of this directory is arbitrary, but since we are using it in the context of providing settings for SLURM jobs, something like slurm_profile would make sense:
mkdir slurm_profileInside this directory, Snakemake will expect a file called config.yaml:
touch slurm_profile/config.yamlIn config.yaml, we can provide a string for the cluster key, much like we did previously with the --cluster argument on the command-line:
cluster: "sbatch --account={resources.account}
--mem={resources.mem_mb}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [mem_mb=4000, account=PAS1855]Above, we use {resources.<resource-name>} instead of actual values in the cluster key. Then, the values for each <resource-name> are specified for the default-resources key. This setup is convenient because it allows us to also refer to the same resources in the Snakefile to set rule-specific values, as we’ll see a little later.
Create the profile directory and config file as shown above, and additionally specify requests for time (in minutes) and number of CPUs.
Note that these resource requests are now a on a per-job basis, not for the whole workflow combined.
Contents of slurm_profile/config.yaml:
cluster: "sbatch --account={resources.account}
--time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]To run Snakemake with the profile, we need to use the option --profile, and specify our profile dir: --profile <profile_dir>.
rm -r res log/*
snakemake -j6 --profile slurm_profileAbove, we also specified an output dir and names for our SLURM log files. If this directory does not already exist, make sure you create it manually, because Snakemake will not create it, and mysterious failures will occur!
So, Snakemake will create directories specified in the Snakefile as output, but won’t create directories that are only mentioned in this config file.
config.yaml can contain not just cluster settings, but anything that can be set with command-line options.
We can take that opportunity to also:
Specify the number of jobs (--jobs option).
Make Snakemake wait longer for output files to appear using the latency-wait option: units are in seconds and we want 30 second.
(I’ve had errors with shorter latency times.)
Add key-value pairs for the number of jobs and the latency to the config file.
Note that in YAML, key-value pairs are specified as key:value, not --key value.
jobs: 100
latency-wait: 30The full slurm_profile/config.yaml should now contain the following:
cluster: "sbatch --account={resources.account}
--time={resources.time_min}
--mem={resources.mem_mb}
--cpus-per-task={resources.cpus}
--output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
jobs: 100
latency-wait: 30Run Snakemake with the updated profile.
Recall that now, you don’t have to specify -j / --jobs at the command line.
Pay attention to the job submission pattern of Snakemake. You can also open up another terminal and monitor your jobs:
squeue -u $USER -l
rm -r res log/*
snakemake --profile slurm_profileIn practice, you may not want to submit a cluster job for every rule.
For instance, rule all is a Snakemake job, but it doesn’t actually run anything the way we have set it up. Additionally, you may have some very lightweight cleaning/logging rules.
To tell Snakemake that certain rules should not be submitted to the cluster, include a comma-separated list of rules near the top of your Snakefile with the localrules key:
localrules: all, cleanrule all, and run Snakemake again.
Your full Snakefile should now contain the following:
SAMPLES = glob_wildcards("data/{smp}.fastq").smp
REF_FA = "metadata/ref.fa"
localrules: all
rule all:
input:
"res/count_table.txt",
expand("res/{smp}.fastqc.html", smp=SAMPLES)
rule trim:
input:
"data/{smp}.fastq",
output:
"res/{smp}_trim.fastq",
log:
"log/trim/{smp}.log",
shell:
"scripts/trim.sh {input} >{output} 2>{log}"
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
rule count:
input:
expand("res/{smp}.bam", smp=SAMPLES),
output:
"res/count_table.txt",
log:
"log/count/count.log",
shell:
"scripts/count.sh {input} >{output} 2>{log}"
rule fastqc:
input:
"data/{smp}.fastq",
output:
"res/{smp}.fastqc.html",
log:
"log/fastqc/{smp}.log",
shell:
"scripts/fastqc.sh {input} res &>{log}"rm -r res log/*
snakemake --profile slurm_profileBecause the number of threads is a parameter that commonly differs between rules, Snakemake has a threads directive that we can add to any rule. This will tell Snakemake how many threads it should use for any single job that the rule in question generates.
It is either --cpus-per-task or --ntasks.
The threads directive in a Snakefile corresponds to the --cpus-per-task SLURM directive. (Recall that SLURM uses the terms CPU, cores and threads roughly equivalently.)
When specifying multiple --cpus-per-task (SLURM) / threads (Snakemake), you are allowing a single process (task) to perform multithreading.
When using the threads directive, we will generally also tell the program itself about the number of threads – see the example below, where we are calling the STAR aligner directly in a Snakemake rule:
rule STAR:
threads: 8
shell: "STAR --runThreadN {threads} ..."For now, we’ll keep it simple and merely use the threads directive:
rule map.
Rule map should now contain the following:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
threads: 4
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"We can use a resources key for any rule to specify (mostly) arbitrary key-value pairs with resources:
rule heavy_stuff:
# [...]
resources: mem_mb=50000
# [...]These are arbitrary in the sense that mem_mb will not directly set the actual maximum memory usage in the way that the threads directive did set the number of threads. Instead, they refer to the keys we use in our config.yaml.
For example, to actually request 50,000 MB of memory for rule heavy_stuff, mem_mb should correspond to the term used in the SLURM resource request in config.yaml:
cluster: "sbatch --mem={resources.mem_mb} ..."
default-resources: [mem_mb=4000, ...]Setting resources: mem_mb=50000 for rule heavy_stuff will then override the default value of 4000, and pass that on the SLURM job request!
rule map to 2 hours.
Assuming time is referred to as time_min in slurm_profile/config.yaml (this has to match!), rule map should now contain the following:
rule map:
input:
fastq="res/{smp}_trim.fastq",
ref=REF_FA,
output:
"res/{smp}.bam",
threads: 8
resources: time_min=120
log:
"log/map/{smp}.log",
shell:
"scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"
There are a few different options:
You can look up (go to “input environment variables”) and echo SLURM environment variables in the script. For the time limit specifically, this variable is $SBATCH_TIMELIMIT, so you could include the following in your scripts/map.sh:
echo "Time limit: $SBATCH_TIMELIMIT"You can print an overview of resources and statistics for your job using scontrol show job <job-ID>. You can do that manually in the shell for a specific SLURM job, or include the following in your script, to have this information include in the SLURM log file:
scontrol show job $SLURM_JOB_IDThe allotted time specifically is also shown when you issue squeue with the -l (long) option:
squeue -u $USER -lNote that it doesn’t seem to be possible to install this extension in VS Code Server at OSC, but you can install it locally.↩︎
This may look a little strange (but think 1-element tuple!), yet is somewhat useful: when every line has a trailing comma, you can always add and remove lines without paying attention to whether a new item follows.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".