class:inverse middle center # *Week 13 – <br> Reproducible workflows with Snakemake* ---- # I: Workflows and workflow tools <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/04/06 --- ## Overview of this week - I: Intro to workflows and workflow tools — today - II: Getting started with Snakemake — today/Thu - III: More advanced Snakemake tips and tricks — Thu <br> Among other things, the **exercises** will walk you through some more Snakemake aspects such as running Snakemake on a cluster. --- ## What do we mean by a "pipeline" or "workflow"? <br> <p align="center"> <img src=img/schem-workflow.svg width=75%> </p> --- ## An example partial workflow script ```sh samples=(smpA smpC smpG) # Trim: for sample in ${samples[@]}; do scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq done # Map: for sample in ${samples[@]}; do scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam done # Count: scripts/count.sh ${samples[@]} > res/count_table.txt ``` --- ## An example partial workflow script With a single `for` loop: ```sh samples=(smpA smpC smpG) # Trim and map: for sample in ${samples[@]}; do scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam done # Count: scripts/count.sh ${samples[@]} > res/count_table.txt ``` --- ## Why create a pipeline / workflow? What are the advantages of creating such a pipeline or workflow, rather than running scripts one-by-one as needed? <br> - **Rerunning** everything is much easier – you may want to do so after adding samples, changing a script, changing settings, and so on. - **Re-applying** the same set of analyses in a different project is much easier. <br> -- - The workflow is a form of **documentation** of the steps taken. - It makes sure you are **including all necessary steps** and are not overlooking a script or a manual step. - **Reproducibility** in is generally improved, e.g. it will be easier for others to repeat you analysis. --- ## Challenges with workflows: <br> You need to rerun parts of it **Rerunning *part* of the workflow may be necessary**, e.g. after: - Some scripts fail. - Some scripts fail for some samples only. - Adding a sample. - You want to change a script or settings somewhere halfway the workflow. --- ## Challenges with workflows: <br> You need to rerun parts of it **Ad-hoc strategies to solve some of these:** - Comment out part of the workflow – to skip a step: ```sh for sample in ${samples[@]}; do #scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam done scripts/count.sh ${samples[@]} > res/count_table.txt ``` -- - Make temporary changes – to run a single new sample: ```sh samples=(smpA smpC smpG smpT) #for sample in ${samples[@]}; do sample=smpD # TEMPORARY LINE scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam #done scripts/count.sh ${samples[@]} > res/count_table.txt ``` --- ## Challenges with workflows: <br> You need to rerun parts of it **Labor-intensive strategy to solve some of these:** Command-line options and `if`-statements to enable flexible running of part of the workflow (and perhaps to change settings): ```sh trim=$1 # true or false map=$2 # true or false count=$3 # true or false for sample in ${samples[@]}; do if [ "$trim" = true ]; then scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq fi if [ "$map" = true ]; then scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam fi done if [ "$count" = true ]; then scripts/count.sh ${samples[@]} > res/count_table.txt fi ``` --- ## Challenges with workflows: <br> managing cluster jobs Say we want to speed things up and submit SLURM jobs for each script: ```sh for sample in ${samples[@]}; do sbatch scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq sbatch scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam done sbatch scripts/count.sh ${samples[@]} > res/count_table.txt ``` .content-box-q[ What is the problem here? ] -- .content-box-answer[ Steps that need output files from prior steps won't wait for those prior steps! ] For any workflow that has some steps operating on each sample separately: **How do you let the steps properly wait for each other?** --- ## Challenges with workflows: <br> managing cluster jobs **Ad-hoc strategies:** - Go back to the original script and **run the whole workflow as a single job,** in which per-sample steps are simply run 1-by-1 in a loop. This will be too time-consuming in many cases... - Rather than running the workflow script as an actual script, just **use it as scaffolding** and submit jobs one-by-one / group-by-group. -- **Labor-intensive strategy:** - Use SLURM job dependencies – but hard to manage for more complex workflows. ```sh for sample in ${samples[@]}; do line=$(sbatch scripts/trim.sh) trim_id=$(echo $line | sed 's/Submitted batch job //') sbatch --dependency=afterok:$trim_id scripts/map.sh done sbatch --dependency=singleton scripts/count.sh ``` --- ## Challenges with workflows: <br> monitor and manage script failure **How do you monitor job failure and automatically stop the workflow upon job failure?** <br> **Ad-hoc strategies (same as earlier):** - Go back to the original script and **run the whole workflow as a single job**: proper Bash settings will at least make the workflow stop upon failure. - Rather than running the workflow script as an actual script, just **use it as scaffolding** and submit jobs one-by-one / group-by-group. <br> -- **Labor-intensive strategy:** - Lots of testing for proper input for and exit status of each script, etc. --- ## The need for specialized workflow tools <p align="center"> <img src=img/nature-feature.png width=100%> </p> ---- -- > Typically, researchers codify workflows using general scripting languages such as Python or Bash. But these often lack the necessary flexibility. >Workflows can involve hundreds to thousands of data files; a pipeline must be able to monitor their progress and exit gracefully if any step fails. And pipelines must be smart enough to work out which tasks need to be re-executed and which do not. --- ## The need for specialized workflow tools? <p align="center"> <img src=img/nature-feature.png width=100%> </p> <br> > Bioinformatician Davis McCarthy at St Vincent’s Institute of Medical Research in Fitzroy, Australia, says Python and R were more than enough for the relatively simple workflows he used as a PhD student. --- ## Workflow management systems Workflow tools, often called "workflow management systems", provide ways to **formally describe and execute workflows.** <br> Some of the most commonly used command-line based options for working with biological sequence data are: - Snakemake - Nextflow - Common Workflow Language (CWL) - Workflow Description Language (WDL) <br> -- While there is a bit of a learning curve for all of these, they solve the challenges discussed above, and offer several other benefits. For many workflows, they will eventually be huge time-savers, both in terms of time spent composing the workflow and in running time of the workflow. --- ## Advantages of formal workflow management – <br> all the buzzwords... - **Reproducibility and transparency** - Document dependencies among data and among code. - Easy visualization of a workflow graph. - Integration with software management. - **Automation** - Detect & rerun upon changes in input files and failed steps. - Rerun for other datasets. - Automate SLURM job submissions. - **Portability** - Workflow can be easily rerun on a laptop, an HPC system, in the cloud, with or without containers. --- ## Advantages of formal workflow management (cont.) - **Scalability** - Efficient execution – all dependencies are known so cluster submission can be optimized. <br> - **Flexibility** due to a separation of: - Generic nuts-and-bolts of the workflow. - Run-specific configuration – samples, directories, options. - Things specific to the runtime environment (laptop vs cluster vs cloud). <br> -- - **Convenience** - Reduces boilerplate code: no need to loop over samples yourself, etc. - Some of the points mentioned above such as automated job submission and integrated software management. --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- <figure> <p align="center"> <img src=img/giaa140fig1.jpeg width="90%"> <figcaption>From <a href="https://academic.oup.com/gigascience/article/10/1/giaa140/6092773/">Reiter et al. 2021: Streamlining data-intensive biology with workflow systems</a></figcaption> </p> </figure> --- ## Advantages of formal workflow management > The essence of what we're trying to do: > > - Automate a workflow > - Document the workflow > - Document the dependencies among data files, code > - Re-run only the necessary code, based on what has changed > > — [Karl Broman](http://kbroman.org/Tools4RR/assets/lectures/01_intro.pdf)