Practical Computing Skills for Omics Data (PLNTPTH 5006)
MCIC Wooster, Ohio State University
2025-08-26
Lead of the CFAES Bioinformatics and Microscopy cores
What I work on
Providing research assistance with omics data analysis
Teaching, such as this course, workshops, and Code Club (https://osu-codeclub.github.io)
Otherwise
Research background in animal evolutionary genomics & speciation
In my free time, I enjoy bird watching – locally & across the world
Bioinformatics Research Associate in the CFAES Bioinformatics cores
Research background
Infectious pathogens, both bacteria and viruses
PhD from CFAH on the Wooster campus
Otherwise
Name
Lab and Department
Research interests and/or current research topics
Why you are taking or attending this course
Something about you that is not work-related, such as a hobby or fun fact
This course tries to enable you to:
Do your research more reproducibly and efficiently
Work with large-scale “omics” datasets
It will focus primarily on general, foundational “computing skills” rather than on specific applications. For example, you will learn to:
Reproducibility and replicability are two related but distinct ideas:
Our focus is on #2.
In general terms, your research is reproducible when you:
“The most basic principle for reproducible research is: Do everything via code.”
—Karl Broman, University of Madison
How do you think this quote relates to the requirements mentioned above? Do you agree?
We will cover the following practices that benefit reproducibility:
We’ll also revisit the general topic of reproducibility in week 5.
Another motivator: working reproducibly will benefit future you!
See Markowetz (2015)’s “Five selfish reasons to work reproducibly” in this week’s readings.
Using code also means that you can work more efficiently and improve automation —
this can be particularly useful when you have to, for example:
Omics data is increasingly important in biology, and most notably includes the study of:
The next lecture will introduce omics data in a bit more detail.
What this course does and does not focus on
While we’ll be using some example omics datasets, this course will not comprehensively cover how to perform specific omics analyses, or its underlying methods — our focus is primarily on foundational computing skills.
To learn more about specific omics analysis, I highly recommended the follow-up course Genome Analytics (HCS 7004) by Dr. Jonathan Fresnedo-Ramirez.
This core will focus on teaching you computing skills that enable you to:
Do your research more reproducibly and efficiently
Work with large-scale “omics” datasets
The Unix shell (or the “Terminal”) is a command-line interface to computers.
Being able to work in the Unix shell is a key skill for omics data analysis, because a lot of the relevant software is operated using the shell.
Good research project documentation & file organization is a necessary starting point for reproducible research.
You’ll learn best practices for project file organization
You’ll learn how to manage your data and software
You’ll learn project documentation strategies, and for documentation will use a plain-text file format called Markdown.
Markdown
Using version control, you can more effectively keep track of project progress, collaborate, share code, revisit earlier versions, and undo.
Git is the version control software we will use, and GitHub is the website that hosts Git projects (repositories).
You’ll also use these to hand in your graded assignments.
Thanks to supercomputer resources, you can work with very large datasets at speed — running up to 100s of analyses in parallel, and using much larger amounts of memory and storage space than a personal computer has.
An omics data analysis often consists of many consecutive steps, each using different programs. A pipeline written with a workflow manager enables you to run and rerun all of this with a single command.
While the Unix shell environment is the best choice for “initial” processing of a lot of omics data, R is the most prominent language in “downstream” statistical analysis and data visualization.
In this course, you will learn the basics of R, how to visualize data in R, and how you can use specialized “packages” for omics data analyis.
R vs. Python
Python is also commonly used but I believe that altogether, R is a far better choice for this course. Python is a great follow-up language for those seeking to specialize in bioinformatics.
Do you use generative AI on a daily basis?
Should I even learn to code nowadays?
For what it’s worth, we don’t believe that it’s no longer useful to learn coding in the age of generative AI. Might it become less broadly useful, such as for “generalist” biological researchers? This is well possible but not clear either.
The Unix shell and shell scripts
Project file organization & documentation
Version control with Git and GitHub
Automated workflow management with Nextflow
R for downstream data analysis and visualization
Using generative AI to help with coding
Be muted by default, but feel free to unmute yourself to ask questions any time!
You can also ask questions in the Zoom chat.
Having your camera turned on is appreciated! 😊
For this to work well, you’ll need to simultaneously see the Zoom window with my screen share and one or more windows on your own computer.
Therefore, I recommend using large and/or multiple monitors if you have access to those.
Also, try to be prepared to share your screen for troubleshooting purposes at any time.
We’ll now do a quick tour of the GitHub website!
You can book through CarmenCanvas or by emailing us.
You can earn a total of 100 points across 6 assignments and 4 final project checkpoints.
A total of 6 graded assignments, worth 10 points each, are due on Sundays in the following weeks:
Nr. | Topic | Week |
---|---|---|
1. | Shell basics | 2 |
2. | Markdown & Git | 4 |
3. | Shell scripting | 6 |
4. | OSC batch jobs | 8 |
5. | Nextflow | 10 |
6. | R | 13 |
The first one is submitted simply by storing your files at OSC, while all others are submitted via GitHub so you can get more practice with that.
Plan and implement a small data processing and/or analysis project, with the following checkpoints:
Checkpoint | What | Due | Points |
---|---|---|---|
1. | Proposal | week 12 | 5 |
2. | Draft | week 14 | 5 |
3. | Oral presentations | week 16 | 10 |
4. | Final submission | week 16 | 20 |
Data sets for the final project
Ideally, you have or eventually develop your own idea for a dataset and analysis — this may for example allow you to do something that’s directly useful for your own research. If not, we can provide you with a dataset.
More information about the final project will follow later in the course.
You are allowed and in some cases encouraged to use AI for certain assignments and exercises: this will be clearly stated on a case-by-case basis.
Weekly readings
Exercises (in weeks that do not have graded assignments)
Occasional small ungraded assignments such as surveys and account setup.
In certain weeks, you will be asked to read 1 or 2 papers. Additionally:
You are always expected to reread and practice with the lecture material we go through in class.
Lecture page contents we don’t cover in class automatically turns into required self-study material.
Callout boxes
In particular, we’ll often skip or glance over contents in “callout boxes” such as this one: read those in your own time.
Like books?
In most weeks, chapters from the following two books are listed as “further resources” (optional reading), which are available online through the OSU library:
We will have an optional but highly recommended weekly recitation meeting on Mondays, to go over the exercises for the preceding week.
Practice is key to learn these skills!
This course is intended to be highly practical. If you don’t spend time practicing by yourself, you may not get all that much out of it.
If you would like to join these sessions, please indicate your availability using this poll.
Lectures:
Homework: