pracs-sp21: Exercises: Week 13 -- Snakemake

In these exercises, you’ll take up where we left off in class, using a dummy RNA-seq workflow.

1: Expanding and improving the workflow

1. Set-up

Load the Conda environment in which you have Snakemake installed:

$ module load python/3.6-conda5.2
$ source activate ipy-env

# Check if Snakemake can be found and is inside your ipy-env:
$ which snakemake
#> ~/.conda/envs/ipy-env/bin/snakemake

We will recreate a very similar set of files and the final Snakefile from class – copy the files from /fs/ess/PAS1855/exercises/week13/ to do so:

$ cd /fs/ess/PAS1855/users/$USER/week13
$ cp -r /fs/ess/PAS1855/exercises/week13/exercise1 .

$ cd exercise1

$ tree
#> .
#> ├── data
#> │   ├── smpA.fastq
#> │   ├── smpC.fastq
#> │   └── smpG.fastq
#> ├── res
#> ├── scripts
#> │   ├── count.sh
#> │   ├── map.sh
#> │   └── trim.sh
#> └── workflow
#>     ├── DAGs
#>     └── Snakefile

Or use this code to create the dirs and files from scratch

Create the dir structure and the dummy FASTQ files:

mkdir -p data scripts res workflow/DAGs

echo "AAAAAAA" > data/smpA.fastq
echo "CCCCCCC" > data/smpC.fastq
echo "GGGGGGG" > data/smpG.fastq

Create a script scripts/trim.sh containing:

#!/bin/bash
set -e -u -o pipefail

echo "FASTQ file: $1 after trimming"
cat "$1"

Create a script scripts/map.sh containing:

#!/bin/bash
set -e -u -o pipefail

echo "BAM from FASTQ file: $1"
cat "$1"

Create a script scripts/count.sh containing:

#!/bin/bash
set -e -u -o pipefail

echo "Counts for $# BAM files:"
cat $@

Make the scripts executable:
```
chmod +x scripts/*
```
Create a file workflow/Snakefile and paste into it:

"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.  
"""

SAMPLES=glob_wildcards("data/{smp}.fastq").smp

rule all:
    """
    This is a "pseudo-rule" meant to trigger the execution of the full workflow. 
    """
    input:
        "res/count_table.txt",

rule trim:
    """
    Trim the FASTQ files.
    """
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    shell:
        "scripts/trim.sh {input} > {output}"

rule map:
    """
    Map the trimmed FASTQ files to a reference genome.
    """
    input:
        "res/{smp}_trim.fastq",
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input} > {output}"

rule count:
    """
    From the BAM files, create a gene count table.
    """
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    shell:
        "scripts/count.sh {input} > {output}"

Currently, our Snakefile workflow/Snakefile contains the following:

Show Snakefile

"""
A dummy RNAseq workflow with FASTQ files as input and a count table as output.  
"""

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

rule all:
    """
    This is a "pseudo-rule" meant to trigger the execution of the full workflow. 
    """
    input:
        "res/count_table.txt",

rule trim:
    """
    Trim the FASTQ files.
    """
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    shell:
        "scripts/trim.sh {input} > {output}"

rule map:
    """
    Map the trimmed FASTQ files to a reference genome.
    """
    input:
        "res/{smp}_trim.fastq",
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input} > {output}"

rule count:
    """
    From the BAM files, create a gene count table.
    """
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    shell:
        "scripts/count.sh {input} > {output}"

Note that some formatting details in the Snakefile above are slightly different from what was shown in class. I used the Snakefmt VS Code extension, which lets a little program of the same name format Snakefiles for you.¹

rule count:
    """
    From the BAM files, create a gene count table.
    """
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    shell:
        "scripts/count.sh {input} > {output}"

Snakefmt will detect syntax errors and ensure consistent and clear formatting, as shown for rule count above:

The value for every rule is on a new, indented line.
Every item in input and output is trailed by a comma.²

Furthermore, I have added some Python multi-line (triple-quote) comments explaining what the Snakefile and each individual rule does.

2. First DAG and Snakemake run

Create and view a DAG for our current Snakefile.
This workflow diagram in Directed Acyclic Graph form will remind you of the steps in the workflow.

Solution

snakemake --dag | dot -T svg > workflow/DAGs/v1.svg

Run the entire workflow.

Solution

Below, Snakemake is being run with the -p (print shell commands) and -q (quiet) options for concise output. To better understand what Snakemake is doing, you may want to see more output yourself, e.g. with the -pr combination of options (-r: reason for execution of each job).

Also, recall that you can always do a “dry run” with the -n option to see what Snakemake wants to execute.

snakemake -j1 -pq
#> Job counts:
#>         count   jobs
#>         1       all
#>         1       count
#>         3       map
#>         3       trim
#>         8
#> scripts/trim.sh data/smpG.fastq > res/smpG_trim.fastq
#> scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#> scripts/trim.sh data/smpC.fastq > res/smpC_trim.fastq
#> scripts/trim.sh data/smpA.fastq > res/smpA_trim.fastq
#> scripts/map.sh res/smpA_trim.fastq > res/smpA.bam
#> scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam > res/count_table.txt

3. Add an input file to the map rule

Currently, our map and count rules don’t include any genomic reference files: mapping should happen against a reference genome (FASTA file), and counting needs an annotation for the reference genome (GTF/GFF file).

For now, we will fix this oversight just for the mapping rule. Let’s create a dummy genome FASTA file:

mkdir metadata
echo "AAAAAAACCCCCCCGGGGGGG" > metadata/ref.fa

We’ll add a line to the map.sh script, such that the second argument passed to it is now the FASTA file with the reference genome:

echo 'echo "Using ref genome: $2" && cat $2' >> scripts/map.sh

Adapt the Snakefile to accommodate this change.

Hints

To do so, you’ll need to do the following in the map rule:

Pass the reference FASTA file to the map.sh script in the shell directive.
While not strictly necessary to avoid errors, you should include the reference FASTA file in the input directive.

Because you will now have multiple items for input, it will be good to name them – the syntax for that is: input: myname=myfile, and you can then recall this using {input.myname} in the shell directive.

Moreover, it is good practice to define hard-coded input files like this one clearly at the top of your script/Snakefile (or use a configuration file.

Solution

Assign the filename somewhere near the top:
```
REF_FA = "metadata/ref.fa"
```

Adjust rule map:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} > {output}"

Do you think Snakemake would rerun parts of the workflow now? How could you test this without actually rerunning anything?

Solution

Snakemake will indeed rerun the rule map and rule count.
It will rerun rule map because one of the input files, the reference FASTA, is newer than the output files (we just created it!).

Then, it will rerun rule count because that depends on the output files of rule map, which have just changed.

If we hadn’t just created the reference FASTA file, Snakemake would not have rerun: it will not rerun based on changes in the Snakefile code.

(However, it will keep track of code changes and you can force a rerun based on files affected by such changes using:
snakemake -j1 -R $(snakemake --list-code-changes))

To test this without actually rerunning anything, we can use the -n flag to do a dry-run. Moreover, if we also use the -r flag, Snakemake will tell us the reason for the execution of every job.

snakemake -j1 -nr

# [...]
[Mon Apr  5 21:14:35 2021]
rule map:
   input: res/smpC_trim.fastq, metadata/ref.fa
   output: res/smpC.bam
   jobid: 2
   reason: Updated input files: metadata/ref.fa  # "REASON" GIVEN HERE 
   wildcards: smp=smpC
# [...]

Rerun Snakemake.

Solution

snakemake -j1 -pq

#> Job counts:
#>         count   jobs
#>         1       all
#>         1       count
#>         3       map
#>         5
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa > res/smpA.bam
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa > res/smpG.bam
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa > res/smpC.bam
#> scripts/count.sh res/smpC.bam res/smpA.bam res/smpG.bam > res/count_table.txt

4. Add a FastQC rule

As you may recall, FastQC performs quality control of FASTQ files (see these slides from week 6). FastQC should be run on the raw and/or trimmed FASTQ files. If you want, you can implement both, but the solution below will do so just for the raw FASTQ files.

Save this dummy script as scripts/fastqc.sh:

#!/bin/bash
set -e -u -o pipefail

infile=$1
outdir=$2

outfile="$outdir"/$(basename "$infile" .fastq).fastqc.html

echo "Running FastQC for file: $infile , to output dir: $outdir"

echo "FastQC results for file $infile" > "$outfile"
cat "$infile" > "$outfile"

Make scripts/fastqc.sh executable.

Solution

chmod u+x scripts/fastqc.sh

Figure out what the script does and create your rule fastqc accordingly.

As the output dir, you can just use the res (results) dir we have been using, or optionally a subdir of that.

Hints

The script will write to an output file <output-dir>/<sample>.fastqc.html. That is, you don’t specify the output file to the script, only the output directory. This is like the behavior of the actual FastQC program.

The script will also print a line to standard out (i.e., not to the output file).
In the Snakefile rule, you do want an output directive, but you don’t reference {output} in the shell directive. Instead, you just provide the output dir as the second argument to the script.

Solution

rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    shell:
        "scripts/fastqc.sh {input} res"

5. Run the FastQC rule

What will happen now if you run Snakemake?
(Without providing it a rule or file to produce.)

Think about what the answer should be before you try.

Solution

It will not run anything:

snakemake -j1 -pr
#> Building DAG of jobs...
#> Nothing to be done.

This is because the FastQC output files are not being used by any other rule, nor are they listed in rule all.

What will happen if you ask Snakemake to specifically run the FastQC rule?

Think about what the answer should be before you try.

Solution

You will get a WorkflowError:

snakemake -j1 -pr fastqc
#> WorkflowError:
#> Target rules may not contain wildcards. Please specify concrete files or a rule without wildcards.

The target rule referenced in the error is fastqc, which we targeted in the command-line call.

The problem is that rule fastqc contains wildcards and Snakemake can’t resolve what their values should be. This may be surprising but recall that Snakemake works backwards in the sense that it looks at output files first (rather than first inferring input files from existing files), and here, it does not know which output files to produce.

How, then, can you get Snakemake to run the FastQC rule? First, think just in terms of changing the Snakemake command-line call.

Solution

You could specify a specific output file for one of the samples, e.g. res/smpA.fastqc.html:

snakemake -j1 -pq res/smpA.fastqc.html
#> Job counts:
#>         count   jobs
#>         1       fastqc
#>         1
#> scripts/fastqc.sh data/smpA.fastq res
#> Running FastQC for file: data/smpA.fastq , to output dir: res

Note above that the line the script printed to standard output (Running FastQC for file: $infile , to output dir: $outdir) is indeed simply printed to standard out among the other output of Snakemake.

How can you change the Snakefile to ensure that FastQC will be run as a standard part of the workflow? Run Snakemake after you changed the Snakefile.
(I.e., without having to invoke each output file separately, like above.)

Hints

You should include the output files of the FastQC script in the input of rule all, using the expand() function.

Solution

Add a line to rule all so that it becomes:

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)

Rerun Snakemake:

snakemake -pq

#> Job counts:
#>       count   jobs
#>       1       all
#>       2       fastqc
#>       3
#> scripts/fastqc.sh data/smpC.fastq res
#> Running FastQC for file: data/smpC.fastq , to output dir: res
#> scripts/fastqc.sh data/smpG.fastq res
#> Running FastQC for file: data/smpG.fastq , to output dir: res

6. When we get errors

Trigger an error in one of the shell scripts:
1. Remove all the output files of the workflow.
2. Change rule map by removing {input.ref} from the shell directive.
3. Try to rerun the workflow.
4. Make sure to undo the error-inducing change to rule map you just made.

Solution

Remove the output files:
```
rm res/*
```

Rule map should look as follows:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input.fastq} > {output}"

Run the workflow:

snakemake -j1 -pr

#> Error in rule map:
#>     jobid: 2
#>     output: res/smpG.bam
#>     shell:
#>         scripts/map.sh res/smpG_trim.fastq > res/smpG.bam
#>         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
#> 
#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bam
#> scripts/map.sh: line 6: $2: unbound variable
#> [Fri Apr  2 18:33:26 2021]
#> Error in rule map:
#>     jobid: 4
#>     output: res/smpC.bam
#>     shell:
#>         scripts/map.sh res/smpC_trim.fastq > res/smpC.bam
#>         (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
#> 
#> Removing output files of failed job map since they might be corrupted:
#> res/smpC.bam
#> Shutting down, this might take some time.
#> Exiting because a job execution failed. Look above for error message

Undo the change:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} > {output}"

The actual error produced by the bash script is what you would likely be interested in if troubleshooting:

scripts/map.sh: line 6: $2: unbound variable.

However, this was buried among all the other Snakemake output. Earlier, we saw that a line of standard output was similarly printed to screen among the Snakemake output. And the same would be true for logging information and errors produced by programs like the actual FastQC.

This is not optimal when errors do occur – in the next section, we will therefore use log files.

Check the output files. Are there output files at all for rule map?

Solution

Nope! These were removed by Snakemake because of the errors. You may have noticed this in the error messages above, where it said, for example:

#> Removing output files of failed job map since they might be corrupted:
#> res/smpG.bam

7. The `log` directive

It would be better to save logging and error output to “log” files for each rule – or each job within a rule, more precisely.

For example, we may be inclined to do the following for a script/program that:

Saves output data to file, and prints logging/errors to screen (standard out and/or standard error – recall that &> redirects both):
```
shell: "myscript.py -i {input} -o {output} &> log/myscript.log"
```
Prints output data to standard out, and logging/errors to standard error (recall that 2> redirects standard error):
```
shell: "myscript.py -i {input} > {output} 2> log/myscript.log"
```

But should we also tell Snakemake explicitly about such a file, like we do for input and output files? Yes, but it’s best to use a separate directive for this: log.

This is convenient because Snakemake treats log files differently than regular output files: if a job fails, Snakemake will not delete the log files. As we saw above, Snakemake does delete regular output files by jobs that produced errors, so if we were to designate logging output as regular output, we would not be able to examine our errors.

Example usage of a log key – note that we can use any wildcards that are also used in {output}:

rule myscript:
    #[...]
    output: res/{sample}.out
    log: log/myscript_{sample}.log
    shell: "myscript.py -i {input} > {output} 2> {log}"

Add a log directive to each rule and rerun Snakemake.
- You’ll also have to redirect standard out and/or standard error (this will depend on the rule!) to the log file in each rule.
- Use a different dir for logging output than the res dir, e.g. log, perhaps best with subdirs for each rule.
- After you have run Snakemake, check that you now have a log file for each job – though most will be empty in the absence of errors; only the logs for the fastqc rule should contain something.

Solution

The full Snakefile with log directives added to each rule (minus comments):

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

REF_FA = "metadata/ref.fa"

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)


rule trim:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    log:
        "log/trim/{smp}.log",
    shell:
        "scripts/trim.sh {input} >{output} 2>{log}"


rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


rule count:
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    log:
        "log/count/count.log",
    shell:
        "scripts/count.sh {input} >{output} 2>{log}"


rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    log:
        "log/fastqc/{smp}.log",
    shell:
        "scripts/fastqc.sh {input} res &>{log}"

Run Snakemake after removing old output:

rm res/*

snakemake -j1 -pq

#> Job counts:
#>         count   jobs
#>         1       all
#>         1       count
#>         3       fastqc
#>         3       map
#>         3       trim
#>         11
#> scripts/fastqc.sh data/smpG.fastq res &>log/fastqc/smpG.log
#> scripts/fastqc.sh data/smpA.fastq res &>log/fastqc/smpA.log
#> scripts/trim.sh data/smpG.fastq >res/smpG_trim.fastq 2>log/trim/smpG.log
#> scripts/trim.sh data/smpC.fastq >res/smpC_trim.fastq 2>log/trim/smpC.log
#> scripts/trim.sh data/smpA.fastq >res/smpA_trim.fastq 2>log/trim/smpA.log
#> scripts/fastqc.sh data/smpC.fastq res &>log/fastqc/smpC.log
#> scripts/map.sh res/smpA_trim.fastq metadata/ref.fa >res/smpA.bam 2>log/map/smpA.log
#> scripts/map.sh res/smpG_trim.fastq metadata/ref.fa >res/smpG.bam 2>log/map/smpG.log
#> scripts/map.sh res/smpC_trim.fastq metadata/ref.fa >res/smpC.bam 2>log/map/smpC.log
#> scripts/count.sh res/smpG.bam res/smpC.bam res/smpA.bam >res/count_table.txt #> 2>log/count/count.log

Check that the log files are there:

tree log

#> log
#> ├── count
#> │   └── count.log
#> ├── fastqc
#> │   ├── smpA.log
#> │   ├── smpC.log
#> │   └── smpG.log
#> ├── map
#> │   ├── smpA.log
#> │   ├── smpC.log
#> │   └── smpG.log
#> └── trim
#>     ├── smpA.log
#>     ├── smpC.log
#>     └── smpG.log

Only the fastqc log files should have nonzero file sizes, which you can quickly check by running ls -lhR log.

2: Running the workflow with SLURM jobs

1. Introduction

Since we are running dummy scripts, it has been fine to run Snakemake on a login node at OSC. For any real workflow, this would not work.

You have two main options:

For small workflows that can be run on one node or less than that, you can run Snakemake as a single SLURM job, either on an interactive node (especially if the workflow is particularly small) or by submitting a script with sbatch.
Snakemake can orchestrate SLURM job submissions for you, in which case every Snakemake job (every instance of every rule) will be submitted as a separate job.

Jobs that depend on others will only be allowed to run when those other jobs have successfully finished, so you can submit a very complex and potentially long-running workflow with a single Snakemake call!

Because the main Snakemake process will keep running as long as one of its constituent jobs is, you will still need to run Snakemake itself as a SLURM job too. Even though this process may not take substantial computing resources, it may have to run for long time – and recall that at OSC, any process that runs for more than 20 minutes on a login node is killed.

2. Submit Snakemake as a single SLURM job

Write a Bash script with SLURM instructions to run Snakemake.

The job should request 1 hour, 1 task, 1 core, should be billed to the classroom project PAS1855, and it should run our current Snakefile.

Save the script as workflow/snakemake_submit.sh.

Solution

#!/bin/bash

#SBATCH --account=PAS1855
#SBATCH --time=60
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1

set -e -u -o pipefail

snakemake -j1 -p

So far, we have been using -j1 (i.e. --jobs 1), such that Snakemake will only run one process at a time. Several of our rules run the same script separately for each sample, so those jobs could be run at the same time (in parallel).

Modify the bash script to parallellize the workflow.
- How many simultaneous jobs (processes) would it make sense to run at the same time?
- Which SLURM directive corresponds to Snakemake’s -j flag: --ntasks or --cpus-per-task?
- How could you make sure there is no mismatch between the resources requested from SLURM and the number of jobs provided to Snakemake?

Hints

For the last question: you should use a SLURM environment variables. The environment variable in question is $SLURM_NTASKS.

Solution

It would make sense to run 6 jobs at the same time:
- Our workflow currently runs 3 samples, with several rules therefore creating 3 separate jobs that could be run in parallel.
- Because rule fastqc and rule map both operate on the raw FASTQ independently, they can also be run in parallel.
Note that you could easily forego this assessment at OSC and just set a maximum that you think is fair, like a 100 jobs – but it is a useful exercise to think about this all the same.
The --ntasks directive effectively corresponds to Snakemake’s jobs (-j):
each SLURM “task” is a separate process.

If we would instead increase --cpus-per-task, we would just allocate more CPUs/cores/threads to each individual process, but we want to run multiple processes in parallel.
```
#!/bin/bash

#SBATCH --account=PAS1855
#SBATCH --time=1:00:00
#SBATCH --nodes=1
#SBATCH --ntasks=6
#SBATCH --cpus-per-task=1

set -e -u -o pipefail

snakemake -j"$SLURM_NTASKS" -p  
```

Delete the workflow’s output files and submit the SLURM job. Check the SLURM log file (slurm-<job-id>.out) to see if it worked.

Solution

rm -r res log/*

sbatch workflow/snakemake_submit.sh

3. Let Snakemake submit jobs for us

At its most basic, all you need to do is add --cluster sbatch to the Snakemake command, and Snakemake will automatically submit a SLURM job for each of its jobs. But recall that at OSC, we also always need to specify an account.

So, we could use the following to have Snakemake submit at most 3 jobs at a time:

snakemake -j6 --cluster "sbatch --account=PAS1855"

Delete the workflow’s output files and run the command above.

4. Use a “profile”

For simple and quick workflows, providing SLURM options on the command line like we did above is reasonable, but if we have several non-default options, this will start to become a painfully long command.

For instance, you may note that the SLURM log files have currently all ended up in your project’s root dir, where you probably don’t want them.

We can also specify these options in a configuration file using a “profile”. This will additionally make it easier to customize SLURM options on a per-rule basis.

To use a profile, we should first create a directory. The name of this directory is arbitrary, but since we are using it in the context of providing settings for SLURM jobs, something like slurm_profile would make sense:

mkdir slurm_profile

Inside this directory, Snakemake will expect a file called config.yaml:

touch slurm_profile/config.yaml

In config.yaml, we can provide a string for the cluster key, much like we did previously with the --cluster argument on the command-line:

cluster: "sbatch --account={resources.account}
                 --mem={resources.mem_mb}
                 --output=log/slurm-{rule}_{wildcards}.out"

default-resources: [mem_mb=4000, account=PAS1855]

Above, we use {resources.<resource-name>} instead of actual values in the cluster key. Then, the values for each <resource-name> are specified for the default-resources key. This setup is convenient because it allows us to also refer to the same resources in the Snakefile to set rule-specific values, as we’ll see a little later.

Create the profile directory and config file as shown above, and additionally specify requests for time (in minutes) and number of CPUs.

Note that these resource requests are now a on a per-job basis, not for the whole workflow combined.

Solution

Contents of slurm_profile/config.yaml:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]

To run Snakemake with the profile, we need to use the option --profile, and specify our profile dir: --profile <profile_dir>.

Remove the old output and run Snakemake with the profile.

Solution

rm -r res log/*

snakemake -j6 --profile slurm_profile

Above, we also specified an output dir and names for our SLURM log files. If this directory does not already exist, make sure you create it manually, because Snakemake will not create it, and mysterious failures will occur!

So, Snakemake will create directories specified in the Snakefile as output, but won’t create directories that are only mentioned in this config file.

4. Add options to the profile config file

config.yaml can contain not just cluster settings, but anything that can be set with command-line options.

We can take that opportunity to also:

Specify the number of jobs (--jobs option).
Make Snakemake wait longer for output files to appear using the latency-wait option: units are in seconds and we want 30 second.
(I’ve had errors with shorter latency times.)
Add key-value pairs for the number of jobs and the latency to the config file.

Hints

Note that in YAML, key-value pairs are specified as key:value, not --key value.

Solution

jobs: 100
latency-wait: 30

The full slurm_profile/config.yaml should now contain the following:

cluster: "sbatch --account={resources.account}
                 --time={resources.time_min}
                 --mem={resources.mem_mb}
                 --cpus-per-task={resources.cpus}
                 --output=log/slurm-{rule}_{wildcards}.out"
default-resources: [cpus=1, mem_mb=1000, time_min=5, account=PAS1855]
jobs: 100
latency-wait: 30

Run Snakemake with the updated profile.

Recall that now, you don’t have to specify -j / --jobs at the command line.

Pay attention to the job submission pattern of Snakemake. You can also open up another terminal and monitor your jobs:

squeue -u $USER -l

Solution

rm -r res log/*

snakemake --profile slurm_profile

5. Local rules

In practice, you may not want to submit a cluster job for every rule.

For instance, rule all is a Snakemake job, but it doesn’t actually run anything the way we have set it up. Additionally, you may have some very lightweight cleaning/logging rules.

To tell Snakemake that certain rules should not be submitted to the cluster, include a comma-separated list of rules near the top of your Snakefile with the localrules key:

localrules: all, clean

Add a localrules directive with rule all, and run Snakemake again.

Solution

Your full Snakefile should now contain the following:

SAMPLES = glob_wildcards("data/{smp}.fastq").smp

REF_FA = "metadata/ref.fa"

localrules: all

rule all:
    input:
        "res/count_table.txt",
         expand("res/{smp}.fastqc.html", smp=SAMPLES)


rule trim:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}_trim.fastq",
    log:
        "log/trim/{smp}.log",
    shell:
        "scripts/trim.sh {input} >{output} 2>{log}"


rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"


rule count:
    input:
        expand("res/{smp}.bam", smp=SAMPLES),
    output:
        "res/count_table.txt",
    log:
        "log/count/count.log",
    shell:
        "scripts/count.sh {input} >{output} 2>{log}"


rule fastqc:
    input:
        "data/{smp}.fastq",
    output:
        "res/{smp}.fastqc.html",
    log:
        "log/fastqc/{smp}.log",
    shell:
        "scripts/fastqc.sh {input} res &>{log}"

rm -r res log/*

snakemake --profile slurm_profile

7. Rule-specific resources: threads

Because the number of threads is a parameter that commonly differs between rules, Snakemake has a threads directive that we can add to any rule. This will tell Snakemake how many threads it should use for any single job that the rule in question generates.

Which SLURM directive does this correspond to?

Hints

It is either --cpus-per-task or --ntasks.

Solution

The threads directive in a Snakefile corresponds to the --cpus-per-task SLURM directive. (Recall that SLURM uses the terms CPU, cores and threads roughly equivalently.)

When specifying multiple --cpus-per-task (SLURM) / threads (Snakemake), you are allowing a single process (task) to perform multithreading.

When using the threads directive, we will generally also tell the program itself about the number of threads – see the example below, where we are calling the STAR aligner directly in a Snakemake rule:

rule STAR:
    threads: 8
    shell: "STAR --runThreadN {threads} ..."

For now, we’ll keep it simple and merely use the threads directive:

Specify 4 threads for rule map.

Solution

Rule map should now contain the following:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    threads: 4
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"

8. Rule-specific resources: other

We can use a resources key for any rule to specify (mostly) arbitrary key-value pairs with resources:

rule heavy_stuff:
    # [...]
    resources: mem_mb=50000
    # [...]

These are arbitrary in the sense that mem_mb will not directly set the actual maximum memory usage in the way that the threads directive did set the number of threads. Instead, they refer to the keys we use in our config.yaml.

For example, to actually request 50,000 MB of memory for rule heavy_stuff, mem_mb should correspond to the term used in the SLURM resource request in config.yaml:

cluster: "sbatch --mem={resources.mem_mb} ..."
default-resources: [mem_mb=4000, ...]

Setting resources: mem_mb=50000 for rule heavy_stuff will then override the default value of 4000, and pass that on the SLURM job request!

Set the time request for rule map to 2 hours.

Solution

Assuming time is referred to as time_min in slurm_profile/config.yaml (this has to match!), rule map should now contain the following:

rule map:
    input:
        fastq="res/{smp}_trim.fastq",
        ref=REF_FA,
    output:
        "res/{smp}.bam",
    threads: 8
    resources: time_min=120
    log:
        "log/map/{smp}.log",
    shell:
        "scripts/map.sh {input.fastq} {input.ref} >{output} 2>{log}"

How could you check whether requests such as your rule-specific time limit are successfully passed on the the SLURM job?

Solution

There are a few different options:

You can look up (go to “input environment variables”) and echo SLURM environment variables in the script. For the time limit specifically, this variable is $SBATCH_TIMELIMIT, so you could include the following in your scripts/map.sh:
```
echo "Time limit: $SBATCH_TIMELIMIT"
```
You can print an overview of resources and statistics for your job using scontrol show job <job-ID>. You can do that manually in the shell for a specific SLURM job, or include the following in your script, to have this information include in the SLURM log file:
```
scontrol show job $SLURM_JOB_ID
```
The allotted time specifically is also shown when you issue squeue with the -l (long) option:
```
squeue -u $USER -l
```

Note that it doesn’t seem to be possible to install this extension in VS Code Server at OSC, but you can install it locally.↩︎
This may look a little strange (but think 1-element tuple!), yet is somewhat useful: when every line has a trailing comma, you can always add and remove lines without paying attention to whether a new item follows.↩︎

Exercises: Week 13 – Snakemake

1: Expanding and improving the workflow

1. Set-up

2. First DAG and Snakemake run

3. Add an input file to the map rule

4. Add a FastQC rule

5. Run the FastQC rule

6. When we get errors

7. The `log` directive

2: Running the workflow with SLURM jobs

1. Introduction

2. Submit Snakemake as a single SLURM job

3. Let Snakemake submit jobs for us

4. Use a “profile”

4. Add options to the profile config file

5. Local rules

7. Rule-specific resources: threads

8. Rule-specific resources: other

Reuse

Exercises: Week 13 – Snakemake

1: Expanding and improving the workflow

1. Set-up

2. First DAG and Snakemake run

3. Add an input file to the map rule

4. Add a FastQC rule

5. Run the FastQC rule

6. When we get errors

7. The log directive

2: Running the workflow with SLURM jobs

1. Introduction

2. Submit Snakemake as a single SLURM job

3. Let Snakemake submit jobs for us

4. Use a “profile”

4. Add options to the profile config file

5. Local rules

7. Rule-specific resources: threads

8. Rule-specific resources: other

Reuse

7. The `log` directive