Bioconductor: R for omics data

Week 13 – lecture B

Authors
Affiliation

Menuka Bhandari

Jelmer Poelstra

Published

November 15, 2025



1 Introduction

R packages are available in many places. In our previous lectures, we learnt to install R packages through CRAN using install.packages() function. However, not all packages are available in CRAN. In this lecture, we will learn about Bioconductor, a specialized repository for bioinformatics packages in R that focuses on analysis of high-throughput genomic data. Bioconductor was initially developed for the analysis of microarray data. Now, its packages are used to analyze a wide range of omics data types such as bulk and single-cell RNA-seq, Chip seq, copy number analysis, micro array methylation flow cytometry, and many other domains. One of the key requirement of Bioconductor packages is to have high level documentation,and workflows. The packages available in Bioconducter are provided here.

1.1 Overview & learning goals

In this lecture, you will learn :

  • Different packages available in Bioconductor
  • How to install Bioconductor packages
  • Data structure of Bioconductor
  • Use DECIPHER package to manipulate DNA sequences and align the genomic sequences

Specifically, you will learn:

  • Install package using BiocManager::install() function
  • Align the genomic sequences

2 Set up and Data preparation

2.1 Open a new Quarto document in Rstudio

Open a new Quarto file File > New File > Quarto Document and save it inside the week13 directory that we created on Tuesday as week13b_bioconductor.

3 Packages available in Biocondctor

There are 4 different types of packages available in Bioconductor:

  1. Software packages (2361) : further divided into two categories infrastructure and methodological packages. Infrastructure packages define how the biological data are stored and accessed. For example, Biostrings: tools for manipulating DNA and RNA sequences. They do not perform the analysis themselves but provide standardized data structures that other packages can use. Methodological packages provide tools and methods to analyze or process the data. For instance, DECIPHER: sequence analysis and manipulation, Deseq2: differential expression analysis are Methodological packages.

  2. Annotation data packages (928) : these are self-contained databases that store genomic annotations (e.g., gene identifiers, biological pathways). They act like reference libraries for biological data. Instead of manually downloading gene IDs or pathway information from external sources, these packages provide ready-to-use annotation data directly in R.For example, OrgDb packages are species-specific annotation databases in Bioconductor. The package org.Hs.eg.db (for humans) maps different types of gene identifiers (e.g., Entrez IDs, gene symbols) and links them to biological pathways and Gene Ontology (GO) terms.

  3. Experiment data packages (435) : contains example datasets mainly used by developers.

  4. Workflows (29) : provide step-by-step guides for common bioinformatics analyses. They do not add new code or functions but demonstrate how to use multiple Bioconductor packages together. For instance, rnaseqGene - RNA-seq workflow package shows how to go from raw FASTQ files, alignment, differential expression analysis, visualization, using several packages together.

Tip

Bioconductor releases a new version every 6 months. Each Bioconductor version is linked to a specific version of R. This is different from CRAN, where packages are added continuously without reference to specific R versions. The Bioconductor website provides guidance on installing packages, creating reproducible documents, and promoting better software engineering practices. The latest version of Bioconductor is 3.22 (released on October 30, 2025), which is is compatible with R 4.5.

  • Bioconductor’s installation function will look up your version of R and give you the appropriate versions of Bioconductor packages.

  • If you want the latest version of Bioconductor, you need to use the latest version of R - Michael Love

4 Install Bioconductor packages

Similar to CRAN, Bioconductor packages needs to be installed once. However, packages are installed slightly differently in Bioconductor. To install Bioconductor packages, you first need to install the BiocManager package from CRAN (if you haven’t already done so). You can then use BiocManager::install() to install specific Bioconductor packages.

4.1 Install BiocManager

We need to install BiocManager through which we can install Bioconductor packages. The command to install BiocManager is provided below:

# Install BiocManager only if it was not installed before
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
# Install BiocManager
install.packages("BiocManager")
# Check BiocManager version
BiocManager::version()

4.2 Install Bioconductor package using BiocManager

Once the BiocManager is installed, use the following command to install DECIPHER package of Bioconductor which we will use later for the sequnce manipulation and sequence alignment. There are 2,361 software packages available in the Bioconductor.

# Install packages
BiocManager::install("DECIPHER")

4.3 Load the packages

You should be an expert now in loading the packages in R since we have loaded so many packages since week11. Use the library() function to load the Bioconductor package in a similar way as CRAN packages.

# Load the DECIPHER package
library(DECIPHER)
Loading required package: Biostrings
Loading required package: BiocGenerics
Loading required package: generics

Attaching package: 'generics'
The following objects are masked from 'package:base':

    as.difftime, as.factor, as.ordered, intersect, is.element, setdiff,
    setequal, union

Attaching package: 'BiocGenerics'
The following objects are masked from 'package:stats':

    IQR, mad, sd, var, xtabs
The following objects are masked from 'package:base':

    anyDuplicated, aperm, append, as.data.frame, basename, cbind,
    colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
    get, grep, grepl, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, saveRDS, table, tapply, unique,
    unsplit, which.max, which.min
Loading required package: S4Vectors
Loading required package: stats4

Attaching package: 'S4Vectors'
The following object is masked from 'package:utils':

    findMatches
The following objects are masked from 'package:base':

    expand.grid, I, unname
Loading required package: IRanges

Attaching package: 'IRanges'
The following object is masked from 'package:grDevices':

    windows
Loading required package: XVector
Loading required package: GenomeInfoDb

Attaching package: 'Biostrings'
The following object is masked from 'package:base':

    strsplit

Every package in Bioconductor has a reference manual. The reference manual for DECIPHER can be found here. It includes a complete list of all functions in the package, syntax for each function, descriptions of arguments, available options and the examples that can be run using the package’s built-in datasets.

# View the inbuilt datasets of DECIPHER
data(package = "DECIPHER")
data("BLOSUM")
str(BLOSUM)
 num [1:24, 1:24, 1:15] 2.4 -0.6 0 0 -1.8 0.6 0 0 -1.2 0 ...
 - attr(*, "dimnames")=List of 3
  ..$ : chr [1:24] "A" "R" "N" "D" ...
  ..$ : chr [1:24] "A" "R" "N" "D" ...
  ..$ : chr [1:15] "30" "35" "40" "45" ...
Package movement from CRAN and Bioconductor

Packages can become available on CRAN after being available in Bioconductor. vcfR package, which allows easy manipulation and visualization of variant call format (VCF) data, was first available only in Bioconductor. Later, the package evolved to be self-contained, not requiring Bioconductor classes, making CRAN a better fit for broader use. BiocManager::install() will also install packages from CRAN. It acts as a wrapper for install.packages()

4.4 Find the package of interest- biocViews

Bioconductor uses biocViews to organize and classify packages, so, that we can find packages easily on the Bioconductor website. biocViews includes a controlled vocabulary to categorize Bioconductor packages. Package authors choose the tags when they submit their package. Tags are later reviewed and refined by Bioconductor maintainers. You can know more about the packages by searching here For example, if you search for “RNASeq” on Bioconductor, you will find all packages tagged with that term.

Rank column usually indicates the popularity or relevance of the package within Bioconductor, based on download statistics, usage in workflows, community interest. edgeR and DESeq2 are two most commonly used package for the differential expression analysis of RNA-seq dataset. They are ranked 23 and 25, respectively, meaning both packages are among the most downloaded in the Bioconductor project, with edgeR having a slightly more popularity.

5 Bioconductor objects can be complex

We discussed in week11 that anything in R is an object. Objects in Bioconductor are more complex than vector or matrix. Bioconductor does not contain single data structure. They contain multiple attributes (e.g., sequence data, metadata, annotations). They are specialized S4 objects designed to store biological data efficiently and with metadata. Each object comes with methods—specific functions to access information inside the object, run analyses, and visualize data. Accessor functions (e.g., length(), width(), metadata()) are needed to retrieve specific information from an object

Some examples of Bioconductor objects

  • Biostrings objects (DNA, RNA, AA sequences) : Used for manipulating sequence strings efficiently

  • SummarizedExperiment (SE): Used in RNA-seq analysis (We will discuss about it in more detail in week14)

An SE object consists of three main components:

  1. assays : A matrix (or list of matrices) of experimental data. Rows represents genomic features (e.g., genes), and columns represents samples. You generate this data from the sequencing machine.

  2. colData : A table of metadata about the samples (columns). Example: sample IDs, conditions, batch info. You generate this from your experiment.

  3. rowData (or rowRanges) :A table of metadata about the features (rows). Example: gene IDs, genomic coordinates, annotations. You generate this data from Ensembel or any other databases.

  • Phylogenetic Tree Objects: From DECIPHER or ape package.

6 Bioconductor packages examples

6.1 Biostrings

It is a core package of DECIPHER. When you install DECIPHER, Bioconductor will automatically install Biostrings and any other required dependencies if they are not already present. Biostrings is used to read, write, search, and manipulate DNA, RNA, or amino acid sequences in which sequences are stored in XStringSet objects. Amino acid sequences are stored in the class AAStringSet, nucleotide sequences are stored in the class DNAStringSet or RNAStringSet. reference manual of Biotsring can be found here

# Using standard R to create a character vector of sequences
sequences <- c("AAATCGA", "ATACAACAT", "TTGCCA")
# Display the sequences
sequences
# Get the number of sequences
length(sequences)
# Get the number of characters in each sequence
nchar(sequences)
# Select the 1st and 3rd sequences
sequences[c(1, 3)]
# Try to get the reverse complement
reverseComplement(sequences)

Base R can store sequences as simple character strings and perform basic operations like length and indexing. However, it does not understand biological meaning. Therefore, operations like reverse complement will fail.

# Load Biostrings package, if you have loaded DECIPHER before
library(Biostrings)
# Convert sequences into a DNAStringSet object
dna <- DNAStringSet(sequences)
# Display the DNAStringSet
dna
# Get the number of sequences
length(dna)
# Get the number of characters in each sequence
nchar(dna)
# Select the 1st and 3rd sequences
dna[c(1, 3)]
# Compute reverse complements (works because Biostrings understands DNA)
reverseComplement(dna)

The Biostrings package provides specialized classes like DNAStringSet that understand biological sequences. This allows us to perform biologically meaningful operations easily.

Tip

There are three different object system in R: S3 , S4 and Reference classes. Object becomes S3 by assigning class. S3 object system is simple but powerful. Since, Bioconductor uses S4 object-oriented programming in R, we will focus more in S4 class. S4 classes have a formal structure, and three important components are:

  • Name: The name of the class (e.g., DNAStringSet, XStringSet). It tells you what type of object you are working with. Use class() to check the class name of an object.

  • Slots: These are named compartments inside the object that store data. Each slot holds specific information (e.g., sequences, metadata). Use slotNames() to list all slots in an object.

  • Contains: This shows inheritance which is a parent classes the current class extends. For example, DNAStringSet contains XStringSet, meaning it inherits its behavior and adds DNA-specific features.

Understanding the class structure the class structure helps you inspect and manipulate data correctly.

# Check the class of the object
class(dna)
# List all slots in the object
slotNames(dna)
# Show the class definition (slots, inheritance)
getClass("DNAStringSet")
# Display the internal structure of the object
str(dna)

Let’s read our own fastq file using the Biostring package

# Path to your FASTQ.gz file
file_path <- "garrigos-data/fastq/ERR10802863_R1.fastq.gz"
# Read the FASTQ file as a DNAStringSet object
dna_sequences <- readDNAStringSet(filepath = file_path, format = "fastq")
# Extract first 5 sequences
dna_sequences[1:5]
# Extract first 10 sequences and assign it to object
top_ten_seq <- dna_sequences[1:10]  
# Write to a new FASTA file
writeXStringSet(x = top_ten_seq, filepath = "top10_sequences.fasta", format = "fasta")

6.2 DECIPHER

DECIPHER is an R/Bioconductor package that provides tools for working with biological sequences. It is designed to help researchers curate, analyze, and manipulate sequence data efficiently. The package supports a wide range of tasks:

  • Sequence databases: Import, maintain, view, and export very large collections of sequences. Instead of loading all sequences into memory at once (which would require a huge amount of RAM), DECIPHER uses databases to store and manage them. This approach makes it possible to analyze millions of sequences without overwhelming system resources.

  • Sequence alignment: accurately align thousands of DNA, RNA, or amino acid sequences. Quickly find and align the syntenic regions of multiple genomes.

  • Oligo design: test oligos in silico, or create new primer and probe sequences optimized for a variety of objectives.

  • Manipulate sequences: trim low quality regions, correct frameshifts, reorient nucleotides, determine consensus, or digest with restriction enzymes.

  • Analyze sequences: find chimeras, classify into a taxonomy of organisms or functions, detect repeats, predict secondary structure, create phylogenetic trees, and reconstruct ancestral states.

  • Gene finding: predict coding and non-coding genes in a genome, extract them from the genome, and export them to a file.

# For quick reference to the DECIPHER package help pages
?DECIPHER
# To view detailed tutorials and step-by-step documentation (vignettes)
browseVignettes("DECIPHER")
# Opens up web page showing all available vignettes for the DECIPHER package installed on your system.

Here, we will use a DECIPHER package for the alignment of the sequences. As the number of genomes for the alignment increases, the computational time increases as well. It is a challenge to improve the performance when the number of sequences are added. However, DECIPHER can align 100 of sequences together accurately in a short time.

Multiple Sequence Alignment can show several things: - Is the regions conserved (syntenic regions) - Is there insertion or deletion?

# Align the sequences
aligned <- AlignSeqs(top_ten_seq)
# View the alignment in your console
aligned
# Opens an interactive viewer in your browser
BrowseSeqs(aligned)
# Write aligned object to a FASTA
writeXStringSet(aligned, filepath = "aligned_sequences.fasta", format = "fasta")

7 Recap and next steps

  • Bioconductor and its packages
  • Install packages in Bioconductor
  • Manipulate and align sequence using DECIPHER
Back to top