With a focus on high-throughput sequencing data
CFAES Bioinformatics Core, OSU
2025-08-26
A brief general overview of omics data
An intro to high-throughput sequencing, a key technology that produces omics data
An intro to Illumina sequencing, a key HTS technology
An idea of what reference genomes are and what they are for
An overview of the course’s main example dataset
Copyright ThermoFisher
Omics type | Molecule type | |
---|---|---|
Genomics | DNA | |
Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
Transcriptomics | RNA | |
Proteomics | Proteins | |
Metabolomics | Metabolites |
-omics
The “omics” suffix indicates the involvement of large-scale datasets — in the sense that, for example, “genomics” data typically spans much or all of the genome.
While the boundaries can be fuzzy, sequencing a single gene in a single organism is not genomics, and running qPCR for a handful of genes is not transcriptomics.
Omics type | Molecule type | Data mainly produced by |
---|---|---|
Genomics | DNA | High-throughput sequencing (HTS) |
Epigenomics | DNA modifications | High-throughput sequencing (HTS) |
Transcriptomics | RNA | High-throughput sequencing (HTS) |
Proteomics | Proteins | Mass Spectometry |
Metabolomics | Metabolites | Mass Spectometry |
HTS and the resulting data are the focus of the rest of this lecture, and used in examples throughout the course.
Note that nearly all HTS currently involves DNA sequencing, including epigenomics and transcriptomics data (e.g., RNA is reverse-transcribed before sequencing).
Poinsignon et al. (2023): Working with omics data: An interdisciplinary challenge at the crossroads of biology and computer science
Figure 2 from the paper.
Sanger sequencing (since 1977)
Sequencing of a single, short DNA fragment at a time. The fragment is typically PCR-amplified and therefore specifically targeted.
High-throughput sequencing (HTS, since 2005)
Sequencing of hundreds of thousands to billions of DNA fragments at a time. Fragments can be targeted in various ways or randomly generated from input DNA.
Reads and sequencing errors
Sequenced DNA fragments are referred to as “reads”. With current technologies, reads are never 100% accurate, and this has large consequences for downstream analyses.
Short-read HTS | Long-read HTS | |
---|---|---|
Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
Short-read HTS | Long-read HTS | |
---|---|---|
Usage | Most common | Less common (but increasing) |
Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
Short-read HTS | Long-read HTS | |
---|---|---|
Usage | Most common | Less common (but increasing) |
Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
Read lengths | 50-300 bp | 10-100+ kbp |
Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
Throughput | Higher | Lower |
Cost per base | Lower | Higher |
Short-read HTS | Long-read HTS | |
---|---|---|
Usage | Most common | Less common (but increasing) |
Main companies | Illumina | Oxford Nanopore Technologies (ONT) & Pacific Biosciences (PacBio) |
Timeline | Since 2005 — technology fairly stable | Since 2011 — still rapid development |
Read lengths | 50-300 bp | 10-100+ kbp |
Error rates | Mostly <0.1% | 1-10% (ONT) / <0.1-10% (PacBio) |
Throughput | Higher | Lower |
Cost per base | Lower | Higher |
AKA | Next-Generation Sequencing (NGS) | Third-generation sequencing |
For example:
For example:
HTS and other omics data analysis can often be roughly divided into two consecutive stages:
Stage I
Stage II
In a sequencing context, a “library” is a collection of nucleic acid fragments ready for sequencing.
We’ll go into some specifics of Illumina library prep because this is the most common type of HTS, and we’ll use Illumina read files as examples throughout the course.
As shown in the previous slide, after library prep, each DNA fragment is flanked by several types of short sequences that together make up the “adapters”:
Multiplexing!
Adapters can include so-called “indices” or “barcodes” that identify individual samples. That way, up to 96 samples can be combined (multiplexed) into a single library, i.e. into a single tube.
DNA fragments can be sequenced from both ends as shown below —
this is called “paired-end” (PE) sequencing:
The DNA fragment’s size (“insert size” ) can vary – by design, but also because of limited precision in size selection. In some cases, it is:
Many HTS applications either require a “reference genome” or involve its production. What exactly does reference genome refer to? It usually includes:
An assembly
A representation of most or all of the genome DNA sequence: the genome assembly
An annotation
Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features
Taxonomic identity
Reference genomes are typically applicable at the species level. For example, if you work with maize, you want a Zea mays reference genome. But:
Many HTS applications either require a “reference genome” or involve its production. What exactly does reference genome refer to? It usually includes:
An assembly
A representation of most or all of the genome DNA sequence: the genome assembly
An annotation
Provides e.g. locations of genes and other genomic “features” in the corresponding genome assembly, and functional information for these features
Chromosomes, scaffolds, and contigs
Nearly all genome assemblies are incomplete and fragmented to some extent. Therefore, in addition to complete chromosome sequences, assemblies may contain:
Library prep makes nucleic acid fragments ready to be sequenced, e.g. by attaching adapter and sample barcode sequences
Illumina sequencing is often done with paired-end reads, and with multiple/many samples at a time (multiplexing)
Throughout the course, we’ll use an example/practice data set from Garrigós et al. (2025):
This paper uses paired-end Illumina RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with two different malaria-causing Plasmodium protozoans.
This is what the Adobe Firefly AI came up with when I tried to get it to help me with producing the diagram on the previous slide 👌 💯
A comparison of gene counts between treatments to see which genes differ in expression levels between treatments. For example, see the plot below for gene counts for a single gene:
At the beginning of week 3, we’ll discuss some related and follow-up material to today’s lecture:
Sequence file types (like the FASTQ format that contains HTS reads)
More details on the Garrigós dataset (and the paper will be reading material for that week)
A more detailed overview of the data processing and analysis steps we will perform on the Garrigós dataset
Modified after Pereira, Oliveira, and Sousa (2020)
…with the advent of HTS.
From Lee (2023)
From Poinsignon et al. (2023)
From Poinsignon et al. (2023)
Inferring the sequence of the source DNA despite the presence of sequencing errors is attempted by sequencing every base multiple times, i.e. obtaining a so-called “depth of coverage” greater than 1:
This process is complicated by genetic variation among and within individuals.
Typical depths of coverage: ~50-100x for genome assembly; 10-30x for resequencing.
Konkel and Slot (2023)