class:inverse middle center # *Week 14 – <br> Putting it all together & next steps* ---- <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/04/13 --- class:inverse middle center # I: Putting it all together ---- <br> <br> <br> --- ## What did we cover? .pull-left[ - **Shell** - Basics - Data tools - Scripting - OSC jobs with SLURM ] .pull-right[ - **Python** - Basics - Coding approaches - Regular expressions - NumPy and Pandas - Biopython ] -- - **Project organization and management** - Formatted plain text files with **Markdown** - Version control with **Git** - Reproducible workflows using **Snakemake** - Along the way, you have gotten to know the **VS Code** editor quite well --- ## What is the big picture? Core goal of the course: .content-box-green[ **Learning strategies and computational skills to do your research more reproducibly and efficiently** ] --- ## Some take-home messages Apply these **strategies and skills**: - Use one directory per project with a clear and consistent sub-dir structure and ditto file naming. - Treat raw data as read-only. - Document everything you do in plain text README files. - Always use version control. - Share your code (using GitHub). --- ## Some take-home messages - Do as much as possible using **code**: - Use code even for tasks you could in some cases do as/more quickly using GUIs – downloads, renaming and re-organizing, systematic literature searches, etc. - "Use code as documentation." - Use shell tools for "simple" problems (consisting of one or a few steps), use Python as tasks gets more complex. - Break your code up in small, reusable bits: keep individual scripts short, use functions in Python, etc. - Write code to be read by humans, and data to be read by computers. --- ## Foundational skills This course has focused on *foundational skills* and *breadth over depth*. Some of the rationale for this – **foundational skills**: > *Learning core bioinformatics data skills will give you the* > *foundation to learn, apply, and assess any bioinformatics program or analysis* > *method. In 10 years, bioinformaticians may only be using a few of the bioinformatics* > *software programs around today. But we most certainly will be using data skills and* > *experimentation to assess data and methods of the future.* > — Buffalo chapter 1 -- - High turn-over of specific methods... - With foundational skills<sup>[1]</sup>, more specialized methods/tools can be picked up with "just-in-time" workshops or even using good online tutorials. - These skills are broadly applicable, and useful in any data-heavy occupation. .footnote[ <sup>[1]</sup> Though not just the one ones taught here, as we'll get back to. ] --- ## Breadth over depth This course has focused on *foundational skills* and *breadth over depth*. Some of the rationale for this – **breadth over depth**: - The first steps are the hardest – and "false starts" can be very common when trying to self-teach tools that will initially take more time than manual approaches. E.g. it is tempting to keep pushing off using version control, or to keep renaming and doing other repetitive things by hand. - You may not be aware of what's out there. <br> -- But there is a flip side to this: - We've seen few practical case studies. - You may not feel you've become really proficient with any of these tools. --- class:inverse middle center # II: Next steps ---- <br> <br> <br> --- ## Various tips going forward - Keep practicing! - Avoid the "tar pit of immediacy". > *It is important to practice regularly, and protect time to play with new tools. > While it is great to have a specific task in mind when starting to program, > it is also important to leave time to acquire and improve new skills.* > *You will be happier if you think “I have invested two hours in learning Python!” > rather than “It took me two hours to solve this easy problem in Python— > I could have done it in Excel in 20 minutes!” While this might be true when > you start, your investment will pay back in the long run.* > *With a bit of perseverance, you will start reusing code to automate processes, > thereby saving a lot of time. > Even better, you will tackle problems that you could not have addressed at all > without your new skills.* — CSB Chapter 11 --- ## Various tips going forward (cont.) - Develop frequently used scripts into "tools", which you can can keep in a separate repository. - Don't try to re-invent the wheel but use existing libraries / packages / programs for specialized tasks (unless it's just for practice). - **Try to make all figures, tables, and statistics in papers to be the result of scripts.** --- ## Some specific next steps by topic .content-box-info[ As you have hopefully seen, there already was a page on the coure's GitHub site with further resources for Python. I have now changed [that page](https://mcic-osu.github.io/pracs-sp21/resources.html) to hold further resources for all other topics. ] --- ## Some specific next steps by topic – Shell - We actually covered quite a bit here! While you could go a lot deeper, for things well beyond the complexity that we discussed, other tools may be better choices: - Python (or R) – for more complex data tasks - Snakemake – for more complex workflow tasks Instead, I would recommend to focus on getting fluent and picking new things up along the way. --- ## Some specific next steps by topic – Shell - But because we had pretty extensive coverage on the shell, it may make sense to provide a list of commonly-used shell tools and topics we didn't cover: - A command-line text editor like *Nano* or *Vim*. - Creating symbolic links ([slides](https://mcic-osu.github.io/pracs-sp21/posts/week-02/slides/02-3-shfiles.html#44)) - `find` and `xargs` ([slides](https://mcic-osu.github.io/pracs-sp21/posts/week-05/slides/05-3_find-xargs.html#1) – and see Buffalo chapter 7) - `join` to merge tabular data ([slides](https://mcic-osu.github.io/pracs-sp21/posts/week-04/slides/04-1_data-tools.html#55) – and see Buffalo chapter 7) - `bioawk` (for sequence data – see Buffalo chapter 7) - Manual software installation and `$PATH` - Process management and monitoring (Buffalo chapter 4) - tar archives - Parameter expansion - If you like shell scripts, look into parsing arguments with `getopts` --- ## Some specific next steps by topic (cont.) ### Python - If you want to keep using Python, I would probably recommend working your way through a **book fully focused on Python.** See the resources page – the choice of book will depends on your preferred direction: general Python vs Bioinformatics vs Data Science. - Jupyter Notebooks <br> -- ### Git - Merge conflicts ([slides](https://mcic-osu.github.io/pracs-sp21/posts/week-03/slides/03-2_git.html#60)). - Undoing things ([slides](https://mcic-osu.github.io/pracs-sp21/posts/week-03/slides/03-3_git.html#1)). --- ## Some specific next steps by topic (cont.) ### Miscellaneous - Use **VS Code locally** and SSH(-tunnel) in. - Learn about using Docker/Singularity containers. - If you're using Windows, consider switching (or dual-booting with) Ubuntu Linux. --- ## Other foundational skills If you want to keep doing "data-centric" work, be that in genomics / bioinformatics or data science, I would also recommend digging into a couple of topics that we did not touch on at all: - Statistics - Data visualization - R – certainly for bioinformatics and probably also the best choice for the two topics above. <br> .content-box-info[ Reading the rest of the two books will get you started with the latter two! ] --- ## OSU Courses - [Microbiome Science Training Track](https://u.osu.edu/coms/for-trainees/microbiome-informatics-training-track/) by the Center of Microbiome Science - [Biomedical Informatics](https://medicine.osu.edu/departments/biomedical-informatics/education/bmi-courses) courses - [STAT 5730: Introduction to R for Data Science](https://stat.osu.edu/courses/stat-5730) - [PLNTPTH 7003.01: Agricultural Genomics](https://plantpath.osu.edu/courses/plntpth-700301) - HCS7004: Genome Analytics (Jonathan Fresnedo-Ramirez) - HCS7806: Data Visualization (Jessica Cooperstone) <br> .content-box-info[ For an overview of "data science and analytics" courses at OSU, see [this TDAI page](https://tdai.osu.edu/programs/). ] --- ## Workshops **Shorter workshops (mostly 2-3 days):** - [Bioinformatics.ca](https://bioinformaticsdotca.github.io/) – Yearly workshops on topics like microbiome analysis, RNAseq, etc. - [Physalia Courses](https://www.physalia-courses.org/courses-workshops/) - The Carpentries – e.g. the [Genomics Curriculum of Data Carpentry](https://datacarpentry.org/lessons/#genomics-workshop) (Since earlier this year there is also an [OSU chapter](https://tdai.osu.edu/carpentries-at-osu/).) <br> **Longer workshops (> 1 week):** - [Summer Institute in Statistical Genetics](https://si.biostat.washington.edu/suminst/) - [Evomics](http://evomics.org/): "Molecular Evolution", "Genomics", "Phylogenomics", etc. - [CSH Programming for Biologists](http://programmingforbiology.org/) --- ## Misc. at OSC .content-box-info[ Also see: - [Microbiome Informatics webinar series](https://u.osu.edu/coms/webinar-series/) with hands-on materials by the Center of Microbiome Science. - [Code Club](https://biodash.github.io/codeclub/) – weekly meeting, hopefully for credits this fall. ] --- ## Recaps / bonus - [SSH set-up and tunneling](https://mcic-osu.github.io/pracs-sp21/w06_UA_ssh.html) - [Using Conda with Snakemake](https://mcic-osu.github.io/pracs-sp21/posts/week-13/slides/13-3.html#conda) - JupyterLab at OSC --- class: inverse middle center # Questions? ---- <br> <br> <br> <br>