class:inverse middle center # *Week 13 – <br> Reproducible workflows with Snakemake* ---- # III: More advanced Snakemake tips and tricks <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/04/08 --- class:inverse middle center # Overview ---- .left[ - #### [Using Conda environments](#conda) - #### [Useful command-line options](#cli) - #### [What will Snakemake run? – revisited](#run) - #### [Configuration files](#config) - #### [Referring to wildcards in actions](#wildcards) - #### [The params directive](#params) - #### [Miscellaneous](#misc) ] --- name: conda class: middle inverse center # Using Conda environments ---- <br> <br> <br> <br> --- ## Using Conda environments Snakemake plays well with Conda and with Singularity containers, and using such software environments can help a lot with making your workflow functional and reproducible across different hardware environments. While containers are even better at this, Conda is more lightweight and will suffice for most purposes. -- Snakemake will in fact **create the necessary Conda environments for us** – all we need to do is: - Include a `conda` key in the focal rule, where we specify a YAML file with instructions for the environment: ```python rule fastqc: conda: "envs/fastqc.yml" # ...other keys... ``` - Include the `--use-conda` option in our Snakemake call: ```sh $ snakemake -j1 --use-conda ``` --- ## Using Conda environments (cont.) Recall that we can create such YAML files from existing Conda environments using `conda env export`, but it is also very easy to write a file from scratch. For instance, for a specific version of FastQC, and specifying `bioconda` as the channel, the full YAML file would look like this:<sup>[1]</sup> ```sh name: fastqc-env channels: - bioconda dependencies: - fastqc=0.11.8 ``` .footnote[ <sup>[1]</sup> For use with Snakemake, the `name` key can also be omitted. ] --- ## Using Conda environments (cont.) Let's try running Snakemake with a Conda environment defined in a YAML file. ```sh sinteractive -A PAS1855 module load python/3.6-conda5.2 source activate ipy-env # Create and move to a new dir to do this: mkdir -p /fs/ess/PAS1855/users/$USER/week14/conda_test cd /fs/ess/PAS1855/users/$USER/week14/conda_test # Create the workflow's dirs and files: mkdir -p data workflow/envs touch workflow/Snakefile workflow/envs/fastqc.yml ``` --- ## Using Conda environments (cont.) We'll run FastQC for real and to do, we'll use an actual FASTQ file that we have seen before: ```sh cp /fs/ess/PAS1855/data/fastq_16S/201*R1*fastq data/test.fastq tree #> . #> ├── data #> │ └── test.fastq #> └── workflow #> ├── envs #> │ └── fastqc.yml #> └── Snakefile ls -lh data/test.fastq #> -rw-r--r-- 1 jelmer PAS0471 31M Apr 15 14:41 data/test.fastq head data/test.fastq #> @M01936:14:000000000-CPB4V:1:1101:21452:1673 1:N:0:CTCTTTAC+CAGGAGTA #> CGAGCAATCCACTCGAGTGTCAGCCGCCGCGGTAATACGGAGGGTGCAAGCGTTATCCGGATTCACTTGGTTTAAAGTGTGCGTAGTCGGACATGTAATTCTGTGTTGAAATCTCTGGGCTTAACCCAGCAACTGCCGTTGATACTATTTTTCTTTAATATCCTGGAGTTAGGCTGAATATTTCATTTAGCGGTGAAATGCTTATATATGACATAGAACACCGATAGCGAAGGCAGCTTACTACAGGATGATTGACGCTGATGCACGAAAGCGTTGGGAGCAAACAGGATTAGAAACC #> + #> @89<CECFEE@FFE<,,C,CE9-@C+@@+8++@+,C,C,,,+++8,6,,,9,8@,C<@,+,8@C,99,,,::C,,,,,,<,6,8,+,<+++8+,,<,,,<A,,<,,:,,,,<A@?E,,,4BE,,:?=+,,,,:B,5?+BC++<,8C,:CD,BBFD,,,C,@@@C,,,,,:,,,5,,,,@,@,@3,@,8,,3**6**,,7,5DF,,,6,@,,7,@,,,,4,6=**4**,*****3**3<C7<>9<++**5++4=++1*/:**++2*2**)*)0)9))*))*1)*)0)*))0=*****0. ``` --- ## Using Conda environments (cont.) Copy this into `workflow/envs/fastqc.yml`: ```ssh channels: - bioconda dependencies: - fastqc=0.11.8 ``` Copy this into `workflow/Snakefile`: ```sh rule fastqc: conda: "envs/fastqc.yml" input: "data/test.fastq", output: "res/test_fastqc.html", shell: "fastqc {input} -o res" ``` Run Snakemake: ```sh # Check first with -n (dry run): snakemake -j1 -n --use-conda res/test_fastqc.html # Looks good? Run: snakemake -j1 --use-conda res/test_fastqc.html ``` --- name: cli class: inverse middle center # Useful command-line options ---- <br> <br> <br> <br> --- ## Useful command-line options: dry run (-n) With the `-n` (long form: `--dryrun`) option, Snakemake will not run any of the actions specified, but it will check which rules/jobs it *would* be running. -- <br> This can be extremely useful to: - Check whether the workflow will run the way you think it will run (remember: relationships between rules are not explicitly defined...). - Check if anything needs to be done: you may not remember which output files are already there, or if an input file has been updated, and this is the easiest way to check! - Check if you have no errors in your Snakefile (e.g. syntax errors, wildcards that Snakemake can't figure out). <br> ```sh snakemake -j1 -n ``` --- ## Useful command-line options: <br> reason (-r) and print shell commands (-p) We have already seen these two options: - `-r` (long form: `--reason`) to let Snakemake tell you *why* it thinks what it should run. - `-p` (long form: `--printshellcmds`) to print the shell commands (action key `shell`) that are being run. <br> *These are very useful to also use in combination with the `-n`/`--dryrun` option, for troubleshooting, and for getting a better intuition for how Snakemake works.* <br> ```sh $ snakemake -j1 -npr ``` --- ## Useful command-line options: Miscellaneous **Some other options:** Short | Long | Explanation -------|------------------|------------- - | `--lint` | Run the Snakemake "linter" on the Snakefile `-f` | `--force <target>` | Force creation of a specified target file, or the first rule. `-F` | `--forceall <rule>` | Force running a specified rule and all dependencies. `-R` | `--forcerun` | Force creation of a list of target files. - | `--report` | Create an HTML report with runtime statistics, workflow topology, and optionally results ([documentation](https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html)). - | `--archive` <span style="color:#e8e8e8">`_________________`</span> | Create a self-contained workflow archive that could be re-executed on any machine with Snakemake and Conda. --- ## Useful command-line options: Miscellaneous **Overview of options we have seen so far:** Short | Long | Explanation -------|------------------|------------- `-q` | `--quiet` | Less output, can be useful with `-n` just to get overview of jobs that will be run. `-p` | `--printshellcmds` | Print `shell` commands that will be executed. `-r` | `--reason` | Give reason of execution for every job. `-n` | `--dryrun` | Don't run anything, just report what *would* be run. `-j` | `--job` | Max. number of jobs to run simultaneously. `-s` | `--snakefile` <span style="color:#e8e8e8">`_________________`</span> | Name of / path to the Snakefile. --- name: run class: middle inverse center # What will Snakemake run? – Revisited ---- <br> <br> <br> <br> --- ## What will Snakemake run? - **Snakemake will create an output file, and all its dependencies, if:** - The output file is a command-line target and does not exist. - The output file is needed by another executed job and does not exist. - The output file is older than the input file. - The input file will be updated by another job. - Execution is enforced with `-f` / `-F` / `-R`. --- ## What will Snakemake run? (cont.) - **Snakemake will not create output files *even if*:** - Code in the rule (or elsewhere in the Snakefile) has changed. - Additional input files have been added *and* input files are e.g. listed at the top of the file or in a config file. <br><br> (With `glob_wildcards()`, new files *will* be automatically run.) --- ## Forcing a rerun after changes in the Snakefile To force a rerun, we could use "brute force" with `Snakemake -j1 -F`, which would rerun the entire workflow. **In real workflows, you probably only want to rerun whatever has been affected by the change in code.** You can do this via the **`--list-code-changes`** option. Running Snakemake with that option will print a list of the output files that have been affected by changes in the Snakefile since Snakemake was last run: ```sh $ snakemake --list-code-changes #> Building DAG of jobs... #> res/smpG.bam #> res/smpC.bam #> res/smpA.bam ``` -- Then, we can use the `-R` option with the list of output files from `--list-code-changes` to force a rerun for those files. We can do this all at once, using a shell command substitution, as follows: ```sh $ snakemake -j1 -pq -R $(snakemake --list-code-changes) ``` --- name: config class: middle inverse center # Configuration files ---- <br> <br> <br> <br> --- ## Configuration files *Configuration files* can be a good place to store settings for your workflow that you want to be able to vary easily and transparently, or that you just want to state very clearly. In your Snakefile, you can include a `configfile` directive (for the entire workflow, not for individual rules) that points to a YAML or JSON file with these configurations. ```python configfile: "config.yml" ``` <br> -- Then, all key-value pairs from the config file will be **loaded into a dictionary called `config`** that you can call in your Snakefile – for example: ```sh # CONFIG FILE: min_qual: 30 ``` ```python # SNAKEFILE: config["min_qual"] # Get the value for `min_qual` ``` --- ## Example usage of a configuration file A config file called `config.yaml` contains the following: ```sh output_dir: path/to/output/ min_qual: 30 ``` In the Snakefile, we access the values in the config files as follows: ```python # [...] configfile: "config.yaml" OUT_DIR=config["output_dir"] # [...] rule filter_bam: input: ... output: os.path.join(OUT_DIR, "{sample}_{read}.fastqc.html") shell: 'filter_bam.sh -q config["min_qual"] {input} > {output}' ``` Note that we assigned `output_dir` to a variable in the Snakefile, and used `min_qual` directly – either setup is possible. --- name: wildcards class: middle inverse center # Referring to wildcards in actions ---- <br> <br> <br> <br> --- ## Referring to wildcards in actions Sometimes you need to **refer to a sample ID in your action** rather than an input file name – say our script needs an input dir and sample ID like so: ```sh scripts/trim.sh -d <input_dir> -s <sample_id> ``` Our rule currently looks like this: ```python rule trim: input: "data/{smp}.fastq", output: "res/{smp}_trim.fastq", shell: "scripts/trim.sh {input} > {output}" ``` **How can we change the `shell` directive to use the sample ID wildcard?** --- ## Referring to wildcards in actions (cont.) **How can we change the `shell` directive to use the sample ID wildcard?** ```python # THIS WON'T WORK: # shell: "scripts/trim.sh -d data -s {smp} > {output}" ``` ```python # YES -- use "{wildcards.<wildcard-name>}": shell: "scripts/trim.sh -d data -s {wildcards.smp} > {output}" ``` Why do we need `{wildcards.smp}`? Why can't we just use `{smp}` akin to how we refer to `{input}`? -- Even though both are variable-like references with `{}`, **only `{smp}` is a wildcard**: its value has not been explicitly defined, and is left for Snakemake to figure out. Snakemake figures out what the values of wilcards should be *only* in `{input}` and `{output}` directives. If we want to refer to them elsewhere, we need to use `{wildcards.<wildcard-name>}`. --- name: params class: middle inverse center # The params directive ---- <br> <br> <br> <br> --- ## The params directive In our modified `rule trim` in the previous slide, we ended up with the following `shell` entry: ```python shell: "scripts/trim.sh -d data -s {wildcards.smp} > {output}" ``` Notice that the directory is directly specified in `-d res`. It may be better to define a setting like this output directory in a more obvious way, and we could use the `params` directive for that. For instance: ```python params: indir="data" shell: "scripts/trim.sh -d {params.indir} -s {wildcards.smp} > {output}" ``` --- ## The params directive (cont.) Additionally, we can directly access wildcards in the `params` directive, so we could have **sample-specific settings**. For example, see [this example from the official Snakemake tutorial](https://snakemake.readthedocs.io/en/stable/tutorial/advanced.html#step-4-rule-parameters), which passes read group information to `bwa mem`: ```python rule bwa_map: input: "data/genome.fa", lambda wildcards: config["samples"][wildcards.sample] output: "mapped_reads/{sample}.bam" params: rg=r"@RG\tID:{sample}\tSM:{sample}" threads: 8 shell: "bwa mem -R '{params.rg}' -t {threads} {input} | samtools view -Sb - > {output}" ``` --- name: misc class: middle inverse center # Miscellaneous ---- <br> <br> <br> <br> --- ## Miscellaneous ### Call separate Snakefiles for parts of the workflow If your workflow is pretty big, it can be useful to split it up across multiple Snakefiles. You can make rules in another Snakefile available as follows: ```python include rules/qc_rules.smk ``` ### Mark files as protected (no write permissions) ```python output: protected("sorted_reads/{sample}.bam") ``` ### Mark files as temporary (to be deleted) ```python output: temp("mapped/{sample}.bam") ``` --- ## Miscellaneous (cont.) ### Use a "token" file if a rule has no unique output For example, the rule only *modifies* an existing file. Then, `touch` the token file in the rule's action: ```python rule token_example: input: 'file.txt' output: 'token_file' shell: "my_command {input} && touch {output}" ``` --- ## Miscellaneous (cont.) ### Visualize rules, not jobs .pull-left[ ```sh $ snakemake --rulegraph | \ dot -Tsvg > rules.svg ``` <p align="center"> <img src=img/rules.svg width=22%> </p> ] .pull-right[ ```sh $ snakemake --dag | \ dot -T svg > jobs.svg ``` <p align="center"> <img src=img/v7.svg width=90%> </p> ]