Course Intro

Practical Computing Skills for Omics Data (PLNTPTH 5006)

Jelmer Poelstra

MCIC Wooster, Ohio State University

2025-08-26

Personal introductions

Introductions: Jelmer (instructor)

Lead of the CFAES Bioinformatics and Microscopy cores

  • Part of what is currently known as the Molecular & Cellular Imaging Center (MCIC)
  • We are now officially under CFAES Research & Analytical Service Cores (RASC), a set of core facilities providing services in molecular biology, high-throughput sequencing, bioinformatics, microscopy, and soil analyses.

What I work on

Otherwise

  • Research background in animal evolutionary genomics & speciation

  • In my free time, I enjoy bird watching – locally & across the world

Introductions: Menuka Bhandari (co-instructor)

Bioinformatics Research Associate in the CFAES Bioinformatics cores

  • Joined just 2 weeks ago!

Research background

  • Infectious pathogens, both bacteria and viruses

  • PhD from CFAH on the Wooster campus

Otherwise

  • I like to try new Nepali food recipes and play sports during my free time

Introductions: You

  • Name

  • Lab and Department

  • Research interests and/or current research topics

  • Why you are taking or attending this course

  • Something about you that is not work-related, such as a hobby or fun fact

Course content: goals and background

The core goals of this course

This course tries to enable you to:

  • Do your research more reproducibly and efficiently

  • Work with large-scale “omics” datasets


It will focus primarily on general, foundational “computing skills” rather than on specific applications. For example, you will learn to:

  • Code in the Unix shell and R
  • Organize, document, manage, and share your research data, code, and results
  • Work with a remote supercomputer
  • Write automated analysis pipelines

Course background I: Reproducibility

Reproducibility and replicability are two related but distinct ideas:

  1. Research is replicable when independent experiments produce the same results
  2. Research is reproducible when the same data produce the same results (a lower bar!)

Our focus is on #2.

Course background I: Reproducibility (cont.)

In general terms, your research is reproducible when you:

  • Share your data
  • Share your methods — in sufficient detail for anyone to redo what you did
  • All reported data, methods and results are congruent

“The most basic principle for reproducible research is: Do everything via code.”
—Karl Broman, University of Madison

How do you think this quote relates to the requirements mentioned above? Do you agree?

Course background I: Reproducibility (cont.)

We will cover the following practices that benefit reproducibility:

  • Using code, and following best practices doing so (throughout)
  • Detailed project documentation (week 2)
  • Good project file organization (week 2)
  • Data and code management & sharing (weeks 4 & 5)
  • Using open-source software with “containers” (week 6)
  • Reporting a clear protocol (week 7) or creating a pipeline to (re)run your analyses (weeks 9-10)

We’ll also revisit the general topic of reproducibility in week 5.


Another motivator: working reproducibly will benefit future you!

See Markowetz (2015)’s “Five selfish reasons to work reproducibly” in this week’s readings.

Course background II: Efficiency and automation

Using code also means that you can work more efficiently and improve automation
this can be particularly useful when you have to, for example:

  • Do repetitive tasks
  • Recreate a figure or redo an analysis after adding a sample
  • Redo all analyses after uncovering a mistake in the first data processing step 😳

Course background III: Omics data

Omics data is increasingly important in biology, and most notably includes the study of:


  • DNA at the (near) whole-genome level: Genomics
  • Expressed RNA at the (near) whole-transcriptome level: Transcriptomics
  • Expressed protein at the (near) whole-proteome level: Proteomics

The next lecture will introduce omics data in a bit more detail.


What this course does and does not focus on

  • While we’ll be using some example omics datasets, this course will not comprehensively cover how to perform specific omics analyses, or its underlying methods — our focus is primarily on foundational computing skills.

  • To learn more about specific omics analysis, I highly recommended the follow-up course Genome Analytics (HCS 7004) by Dr. Jonathan Fresnedo-Ramirez.

Recap: course goals and background

This core will focus on teaching you computing skills that enable you to:

  • Do your research more reproducibly and efficiently

  • Work with large-scale “omics” datasets

Course content: topics

The Unix shell & shell scripts

The Unix shell (or the “Terminal”) is a command-line interface to computers.

Being able to work in the Unix shell is a key skill for omics data analysis, because a lot of the relevant software is operated using the shell.


  • You’ll spend a lot of time with the Unix shell, starting next week.
  • You’ll also write shell scripts, and will use an editor called VS Code for this and other purposes.

The Bash logo.

Bash (shell language)

The VS Code logo.

VS Code

Project file organization & documentation

Good research project documentation & file organization is a necessary starting point for reproducible research.


  • You’ll learn best practices for project file organization

  • You’ll learn how to manage your data and software

  • You’ll learn project documentation strategies, and for documentation will use a plain-text file format called Markdown.


The Markdown logo.

Markdown

Version control with Git and GitHub

Using version control, you can more effectively keep track of project progress, collaborate, share code, revisit earlier versions, and undo.


  • Git is the version control software we will use, and GitHub is the website that hosts Git projects (repositories).

  • You’ll also use these to hand in your graded assignments.



The Git logo.

The GitHub logo.

High-performance computing with OSC

Thanks to supercomputer resources, you can work with very large datasets at speed — running up to 100s of analyses in parallel, and using much larger amounts of memory and storage space than a personal computer has.


  • You will use the Ohio Supercomputer Center (OSC) throughout the course, and will get a brief intro to it this week
  • In week 5, you’ll learn how to manage data and software at OSC
  • In week 7, you’ll learn to submit shell scripts at OSC as “batch jobs”


The OSC logo.



The Apptainer logo.

The Slurm logo.

Automated workflow management

An omics data analysis often consists of many consecutive steps, each using different programs. A pipeline written with a workflow manager enables you to run and rerun all of this with a single command.


  • You’ll use the workflow language Nextflow to build your pipelines
  • You will also learn how to use comprehensive, best-practice omics data Nextflow pipelines produced by the nf-core initiative


The Nextflow logo.

The nf-core logo.

R for downstream data analysis and visualization

While the Unix shell environment is the best choice for “initial” processing of a lot of omics data, R is the most prominent language in “downstream” statistical analysis and data visualization.

In this course, you will learn the basics of R, how to visualize data in R, and how you can use specialized “packages” for omics data analyis.


The R logo.


The RStudio logo.

The Quarto logo.

R vs. Python

Python is also commonly used but I believe that altogether, R is a far better choice for this course. Python is a great follow-up language for those seeking to specialize in bioinformatics.

Using generative AI to help with coding

  • Recently, generative AI (genAI; also referred to as Large Language Models, LLM) tools such as ChatGPT have become ubiquitous.

Do you use generative AI on a daily basis?


  • One major use case of generative AI is that it can help you code, and we’ll cover this in week 8.

Should I even learn to code nowadays?

For what it’s worth, we don’t believe that it’s no longer useful to learn coding in the age of generative AI. Might it become less broadly useful, such as for “generalist” biological researchers? This is well possible but not clear either.

Recap: course topics

  • The Unix shell and shell scripts

  • Project file organization & documentation

  • Version control with Git and GitHub

  • Automated workflow management with Nextflow

  • R for downstream data analysis and visualization

  • Using generative AI to help with coding

Course practicalities: general

Zoom

  • Be muted by default, but feel free to unmute yourself to ask questions any time!

  • You can also ask questions in the Zoom chat.

  • Having your camera turned on is appreciated! 😊

Participatory live coding

  • Most class time will be spent doing “participatory live coding”, where I demonstrate the typing and running of code live, and you’re expected to follow along on your own computer.
  • For this to work well, you’ll need to simultaneously see the Zoom window with my screen share and one or more windows on your own computer.

  • Therefore, I recommend using large and/or multiple monitors if you have access to those.


Also, try to be prepared to share your screen for troubleshooting purposes at any time.

Course websites

  • The main website for this course is this “GitHub website”, which contains:
    • Overviews of each week & readings
    • Slide decks (only two!) and lecture pages (for all other lectures)
    • Assignments and exercises

We’ll now do a quick tour of the GitHub website!


  • There is also a CarmenCanvas site for this course, which is primarily used to:
    • Send weekly announcements about the upcoming week (turn on Notifications!)
    • Get deadlines on the calendar for you

Office hours

  • Jelmer: Wednesdays at 1-3 pm
  • Menuka: Fridays at 1-3 pm

You can book through CarmenCanvas or by emailing us.

Course practicalities: homework and grading

What your grade is made up of

You can earn a total of 100 points across 6 assignments and 4 final project checkpoints.

Graded assignments (60 pts)

A total of 6 graded assignments, worth 10 points each, are due on Sundays in the following weeks:

Nr. Topic Week
1. Shell basics 2
2. Markdown & Git 4
3. Shell scripting 6
4. OSC batch jobs 8
5. Nextflow 10
6. R 13

The first one is submitted simply by storing your files at OSC, while all others are submitted via GitHub so you can get more practice with that.

Final project (40 pts)

Plan and implement a small data processing and/or analysis project, with the following checkpoints:

Checkpoint What Due Points
1. Proposal week 12 5
2. Draft week 14 5
3. Oral presentations week 16 10
4. Final submission week 16 20


Data sets for the final project

Ideally, you have or eventually develop your own idea for a dataset and analysis — this may for example allow you to do something that’s directly useful for your own research. If not, we can provide you with a dataset.

More information about the final project will follow later in the course.

Using generative AI for graded assignments

You are allowed and in some cases encouraged to use AI for certain assignments and exercises: this will be clearly stated on a case-by-case basis.

Ungraded homework

  • Weekly readings

  • Exercises (in weeks that do not have graded assignments)

  • Occasional small ungraded assignments such as surveys and account setup.

Readings

In certain weeks, you will be asked to read 1 or 2 papers. Additionally:

  • You are always expected to reread and practice with the lecture material we go through in class.

  • Lecture page contents we don’t cover in class automatically turns into required self-study material.

Callout boxes

In particular, we’ll often skip or glance over contents in “callout boxes” such as this one: read those in your own time.


Like books?

In most weeks, chapters from the following two books are listed as “further resources” (optional reading), which are available online through the OSU library:

Weekly recitation on Monday

We will have an optional but highly recommended weekly recitation meeting on Mondays, to go over the exercises for the preceding week.


Practice is key to learn these skills!

This course is intended to be highly practical. If you don’t spend time practicing by yourself, you may not get all that much out of it.


If you would like to join these sessions, please indicate your availability using this poll.

Rest of this week

Lectures:


Homework:

Questions?





References

Allesina, Stefano. 2019. Computing Skills for Biologists: A Toolbox. Princeton, NJ: Princeton University Press,. https://doi.org/10.1515/9780691183961.
Buffalo, Vince. 2015. Bioinformatics Data Skills [Reproducible and Robust Research With Open Source Tools]. First edition. Beijing: O’Reilly.
Markowetz, Florian. 2015. “Five Selfish Reasons to Work Reproducibly.” Genome Biology 16 (1): 1–4. https://doi.org/10.1186/s13059-015-0850-7.
Poinsignon, Thibault, Pierre Poulain, Mélina Gallopin, and Gaëlle Lelandais. 2023. “Working with Omics Data: An Interdisciplinary Challenge at the Crossroads of Biology and Computer Science.” In, 313–30. Humana, New York, NY. https://doi.org/10.1007/978-1-0716-3195-9_10.