Week 3 – Lecture A
CFAES Bioinformatics Core, OSU
2025-09-05
DNA and protein (amino acid) sequences and sequence annotations are mostly stored in plain-text formats. Some of the most common types, which you’ll work with in this course, are:
FASTA files contain one or more DNA or amino acid sequences, with no limits on the number of sequences or the sequence lengths.
The following example FASTA file contains two entries:
>unique_sequence_ID Optional description
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATG
Each entry contains a header and the sequence itself, where:
>
” and provide identifying information for the sequenceFASTA files most commonly have a .fa
or .fasta
file extension. Also used are .fna
and .faa
, which make explicit that it contains nucleotide or amino acid sequences, respectively.
FASTQ is the standard format for HTS reads. Each read forms one FASTQ entry and is represented by four lines, which contain, respectively:
@
and e.g. uniquely identifies the read+
The quality scores in the reads represent an estimate of the error probability of the base call.
Specifically, they correspond to a numeric “Phred” quality score (\(Q\)), which is a function of the estimated probability that a base call is erroneous (\(P\)):
\[ Q = -10 * log_{10}(P) \]
For example:
\[ 10 = -10 * log_{10}(0.1) \]
\[ 20 = -10 * log_{10}(0.01) \]
The quality scores in the reads represent an estimate of the error probability of the base call.
Specifically, they correspond to a numeric “Phred” quality score (\(Q\)), which is a function of the estimated probability that a base call is erroneous (\(P\)):
\[ Q = -10 * log_{10}(P) \]
Some specific probabilities and their rough qualitative interpretation for Illumina data:
Phred quality score | Error probability | Rough interpretation |
---|---|---|
10 | 1 in 10 | terrible |
20 | 1 in 100 | bad |
30 | 1 in 1,000 | good |
40 | 1 in 10,000 | excellent |
To make things a little more complicated: the numeric Phred quality score is represented in FASTQ files by a “ASCII character” corresponding to the number rather than the number itself.
This allows for a single-character representation of each possible score — therefore, each quality score character can conveniently correspond to (& line up with) a base character in the read.
For example:
Phred quality score | Error probability | ASCII character |
---|---|---|
10 | 1 in 10 | + |
20 | 1 in 100 | 5 |
30 | 1 in 1,000 | ? |
40 | 1 in 10,000 | I |
To make things a little more complicated: the numeric Phred quality score is represented in FASTQ files not by the number itself, but by a corresponding “ASCII character”.
This allows for a single-character representation of each possible score — therefore, each quality score character can conveniently correspond to (& line up with) a base character in the read.
While FASTQ files can be any size, separate files are typically used for each:
Read direction for paired-end sequences
Read direction | File name identifier |
---|---|
Forward | _R1 |
Reverse | _R2 |
Sample
Reads are normally “demultiplexed” into separate files for each sample.
The very similar GTF (.gtf
) and GFF (.gff
) formats are tabular annotation files with:
After a metadata header, the tabular part looks like this (column names added for clarity):
#seqname source feature start end score strand frame attributes
NC_068937.1 Gnomon gene 2046 110808 . + . gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding";
NC_068937.1 Gnomon transcript 2046 110808 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA";
NC_068937.1 Gnomon exon 2046 2531 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "1";
NC_068937.1 Gnomon exon 52113 52136 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "2";
Details for a few columns:
Throughout the course, we will use an example/practice data set from Garrigós et al. (2025):
This paper uses RNA-Seq data to study gene expression in Culex pipiens mosquitos infected with malaria-causing Plasmodium protozoans — specifically, it compares mosquitos according to:
The following number of samples (biological replicates) were used for each treatment combination:
Control | P. cathemerium | P. relictum | |
---|---|---|---|
01 DAI | 3 | 3 | 4 |
10 DAI | 4 | 4 | 4 |
21 DAI | 4 | 4 | 3 |
To keep things manageable while practicing, I have subset the data to omit the 21-day samples and only keep 500,000 reads per FASTQ file.
All in all, the set of files, available through GitHub, consists of: