Course Intro

Practical Computing Skills for Omics Data (PLNTPTH 5006)

Jelmer Poelstra

MCIC Wooster, Ohio State University

2025-08-26

Personal introductions

Introductions: Jelmer (instructor)

Lead of the CFAES Bioinformatics and Microscopy cores

Part of what is currently known as the Molecular & Cellular Imaging Center (MCIC)
We are now officially under CFAES Research & Analytical Service Cores (RASC), a set of core facilities providing services in molecular biology, high-throughput sequencing, bioinformatics, microscopy, and soil analyses.

What I work on

Providing research assistance with omics data analysis
Teaching, such as this course, workshops, and Code Club (https://osu-codeclub.github.io)

Otherwise

Research background in animal evolutionary genomics & speciation
In my free time, I enjoy bird watching – locally & across the world

Introductions: Menuka Bhandari (co-instructor)

Bioinformatics Research Associate in the CFAES Bioinformatics cores

Joined just 2 weeks ago!

Research background

Infectious pathogens, both bacteria and viruses
PhD from CFAH on the Wooster campus

Otherwise

I like to try new Nepali food recipes and play sports during my free time

Introductions: You

Name
Lab and Department
Research interests and/or current research topics
Why you are taking or attending this course
Something about you that is not work-related, such as a hobby or fun fact

Course content: goals and background

The core goals of this course

This course tries to enable you to:

Do your research more reproducibly and efficiently
Work with large-scale “omics” datasets

It will focus primarily on general, foundational “computing skills” rather than on specific applications. For example, you will learn to:

Code in the Unix shell and R
Organize, document, manage, and share your research data, code, and results
Work with a remote supercomputer
Write automated analysis pipelines

Course background I: Reproducibility

Reproducibility and replicability are two related but distinct ideas:

Research is replicable when independent experiments produce the same results
Research is reproducible when the same data produce the same results (a lower bar!)

Our focus is on #2.

Course background I: Reproducibility (cont.)

In general terms, your research is reproducible when you:

Share your data
Share your methods — in sufficient detail for anyone to redo what you did
All reported data, methods and results are congruent

“The most basic principle for reproducible research is: Do everything via code.”
—Karl Broman, University of Madison

How do you think this quote relates to the requirements mentioned above? Do you agree?

Course background I: Reproducibility (cont.)

We will cover the following practices that benefit reproducibility:

Using code, and following best practices doing so (throughout)
Detailed project documentation (week 2)
Good project file organization (week 2)

Data and code management & sharing (weeks 4 & 5)
Using open-source software with “containers” (week 6)
Reporting a clear protocol (week 7) or creating a pipeline to (re)run your analyses (weeks 9-10)

We’ll also revisit the general topic of reproducibility in week 5.

Another motivator: working reproducibly will benefit future you!

See Markowetz (2015)’s “Five selfish reasons to work reproducibly” in this week’s readings.

Course background II: Efficiency and automation

Using code also means that you can work more efficiently and improve automation —
this can be particularly useful when you have to, for example:

Do repetitive tasks
Recreate a figure or redo an analysis after adding a sample
Redo all analyses after uncovering a mistake in the first data processing step 😳

Course background III: Omics data

Omics data is increasingly important in biology, and most notably includes the study of:

DNA at the (near) whole-genome level: Genomics
Expressed RNA at the (near) whole-transcriptome level: Transcriptomics
Expressed protein at the (near) whole-proteome level: Proteomics

The next lecture will introduce omics data in a bit more detail.

What this course does and does not focus on

While we’ll be using some example omics datasets, this course will not comprehensively cover how to perform specific omics analyses, or its underlying methods — our focus is primarily on foundational computing skills.
To learn more about specific omics analysis, I highly recommended the follow-up course Genome Analytics (HCS 7004) by Dr. Jonathan Fresnedo-Ramirez.

Recap: course goals and background

This core will focus on teaching you computing skills that enable you to:

Do your research more reproducibly and efficiently
Work with large-scale “omics” datasets

Course content: topics

The Unix shell & shell scripts

The Unix shell (or the “Terminal”) is a command-line interface to computers.

Being able to work in the Unix shell is a key skill for omics data analysis, because a lot of the relevant software is operated using the shell.

You’ll spend a lot of time with the Unix shell, starting next week.
You’ll also write shell scripts, and will use an editor called VS Code for this and other purposes.

Bash (shell language)

VS Code

Project file organization & documentation

Good research project documentation & file organization is a necessary starting point for reproducible research.

You’ll learn best practices for project file organization
You’ll learn how to manage your data and software
You’ll learn project documentation strategies, and for documentation will use a plain-text file format called Markdown.

Markdown

Version control with Git and GitHub

Using version control, you can more effectively keep track of project progress, collaborate, share code, revisit earlier versions, and undo.

Git is the version control software we will use, and GitHub is the website that hosts Git projects (repositories).
You’ll also use these to hand in your graded assignments.

High-performance computing with OSC

Thanks to supercomputer resources, you can work with very large datasets at speed — running up to 100s of analyses in parallel, and using much larger amounts of memory and storage space than a personal computer has.

You will use the Ohio Supercomputer Center (OSC) throughout the course, and will get a brief intro to it this week
In week 5, you’ll learn how to manage data and software at OSC
In week 7, you’ll learn to submit shell scripts at OSC as “batch jobs”

Automated workflow management

An omics data analysis often consists of many consecutive steps, each using different programs. A pipeline written with a workflow manager enables you to run and rerun all of this with a single command.

You’ll use the workflow language Nextflow to build your pipelines
You will also learn how to use comprehensive, best-practice omics data Nextflow pipelines produced by the nf-core initiative

R for downstream data analysis and visualization

While the Unix shell environment is the best choice for “initial” processing of a lot of omics data, R is the most prominent language in “downstream” statistical analysis and data visualization.

In this course, you will learn the basics of R, how to visualize data in R, and how you can use specialized “packages” for omics data analyis.

R vs. Python

Python is also commonly used but I believe that altogether, R is a far better choice for this course. Python is a great follow-up language for those seeking to specialize in bioinformatics.

Using generative AI to help with coding

Recently, generative AI (genAI; also referred to as Large Language Models, LLM) tools such as ChatGPT have become ubiquitous.

Do you use generative AI on a daily basis?

One major use case of generative AI is that it can help you code, and we’ll cover this in week 8.

Should I even learn to code nowadays?

For what it’s worth, we don’t believe that it’s no longer useful to learn coding in the age of generative AI. Might it become less broadly useful, such as for “generalist” biological researchers? This is well possible but not clear either.

Recap: course topics

The Unix shell and shell scripts
Project file organization & documentation
Version control with Git and GitHub
Automated workflow management with Nextflow
R for downstream data analysis and visualization
Using generative AI to help with coding

Course practicalities: general

Zoom

Be muted by default, but feel free to unmute yourself to ask questions any time!
You can also ask questions in the Zoom chat.
Having your camera turned on is appreciated! 😊

Participatory live coding

Most class time will be spent doing “participatory live coding”, where I demonstrate the typing and running of code live, and you’re expected to follow along on your own computer.

For this to work well, you’ll need to simultaneously see the Zoom window with my screen share and one or more windows on your own computer.
Therefore, I recommend using large and/or multiple monitors if you have access to those.

Also, try to be prepared to share your screen for troubleshooting purposes at any time.

Course websites

The main website for this course is this “GitHub website”, which contains:
- Overviews of each week & readings
- Slide decks (only two!) and lecture pages (for all other lectures)
- Assignments and exercises

We’ll now do a quick tour of the GitHub website!

There is also a CarmenCanvas site for this course, which is primarily used to:
- Send weekly announcements about the upcoming week (turn on Notifications!)
- Get deadlines on the calendar for you

Office hours

Jelmer: Wednesdays at 1-3 pm
Menuka: Fridays at 1-3 pm

You can book through CarmenCanvas or by emailing us.

Course practicalities: homework and grading

What your grade is made up of

You can earn a total of 100 points across 6 assignments and 4 final project checkpoints.

Graded assignments (60 pts)

A total of 6 graded assignments, worth 10 points each, are due on Sundays in the following weeks:

Nr.	Topic	Week
1.	Shell basics	2
2.	Markdown & Git	4
3.	Shell scripting	6
4.	OSC batch jobs	8
5.	Nextflow	10
6.	R	13

The first one is submitted simply by storing your files at OSC, while all others are submitted via GitHub so you can get more practice with that.

Final project (40 pts)

Plan and implement a small data processing and/or analysis project, with the following checkpoints:

Checkpoint	What	Due	Points
1.	Proposal	week 12	5
2.	Draft	week 14	5
3.	Oral presentations	week 16	10
4.	Final submission	week 16	20

Data sets for the final project

Ideally, you have or eventually develop your own idea for a dataset and analysis — this may for example allow you to do something that’s directly useful for your own research. If not, we can provide you with a dataset.

More information about the final project will follow later in the course.

Using generative AI for graded assignments

You are allowed and in some cases encouraged to use AI for certain assignments and exercises: this will be clearly stated on a case-by-case basis.

Ungraded homework

Weekly readings
Exercises (in weeks that do not have graded assignments)
Occasional small ungraded assignments such as surveys and account setup.

Readings

In certain weeks, you will be asked to read 1 or 2 papers. Additionally:

You are always expected to reread and practice with the lecture material we go through in class.
Lecture page contents we don’t cover in class automatically turns into required self-study material.

Callout boxes

In particular, we’ll often skip or glance over contents in “callout boxes” such as this one: read those in your own time.

Like books?

In most weeks, chapters from the following two books are listed as “further resources” (optional reading), which are available online through the OSU library:

Allesina (2019): Computing Skills for Biologists (library link)
Buffalo (2015): Bioinformatics Data Skills (library link)

Weekly recitation on Monday

We will have an optional but highly recommended weekly recitation meeting on Mondays, to go over the exercises for the preceding week.

Practice is key to learn these skills!

This course is intended to be highly practical. If you don’t spend time practicing by yourself, you may not get all that much out of it.

If you would like to join these sessions, please indicate your availability using this poll.

Rest of this week

Lectures:

Homework:

Ungraded assignments:
- Pre-course survey
- Make sure you have access to OSC
Readings:
- Poinsignon et al. (2023): Working with omics data: An interdisciplinary challenge at the crossroads of biology and computer science
- Markowetz (2015): Five selfish reasons to work reproducibly

Questions?

References

Allesina, Stefano. 2019. Computing Skills for Biologists: A Toolbox. Princeton, NJ: Princeton University Press,. https://doi.org/10.1515/9780691183961.

Buffalo, Vince. 2015. Bioinformatics Data Skills [Reproducible and Robust Research With Open Source Tools]. First edition. Beijing: O’Reilly.

Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biology 16 (1): 1–4. https://doi.org/10.1186/s13059-015-0850-7.

Poinsignon, Thibault, Pierre Poulain, Mélina Gallopin, and Gaëlle Lelandais. 2023. “Working with Omics Data: An Interdisciplinary Challenge at the Crossroads of Biology and Computer Science.” In, 313–30. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3195-9_10.