Analyze microbiome experimental data as a phyloseq object - explore ecological metrics and identify differentially abundant taxa.
This is a slightly modified version of a document created by Matthew Willman.
To convert an Rmd
(R Markdown) file to an R script, type the following in an R console: knitr::purl(input="<filename>.Rmd")
. (You can download this Rmd
file by clicking the Code
button in the top-right of this page – but we will open it at OSC.)
This assumes you still have an active RStudio Server job at OSC; if not see the instructions on the previous page to start one.
In the Files pane, in the markdown
directory, open the file 08-postASV-analysis.Rmd
. That’s this file!
Add our custom library again if you restarted your R session:
And load the necessary packages:
packages <- c('tidyverse', 'vegan', 'phyloseq', 'decontam', 'ape',
'DESeq2', 'microbiome', 'metagenomeSeq', 'remotes',
'ampvis2', 'breakaway')
pacman::p_load(char = packages)
# If you wanted to install and load these packages yourself,
# first make sure you have the pacman package installed:
## install.packages('pacman')
# Then, the code above would work except for the last two packages,
# which are available from Github only. To install these, you would run:
## remotes::install_github("MadsAlbertsen/ampvis2")
## remotes::install_github("adw96/breakaway")
This session starts with a phyloseq object similar to the one generated using DADA2 in the previous session.
ps_raw <- readRDS(file.path(indir, 'ps_16S_V4_withtree.rds'))
# This would load the object from the previous session, which doesn't contain a tree:
# ps_raw <- readRDS(file.path(indir, 'ps_V4.rds'))
Our phyloseq object is made up of the following components (slots):
otu_table
sample_data
tax_table
phy_tree
Let’s have a look at the object:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 18038 taxa and 75 samples ]
## sample_data() Sample Data: [ 75 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 18038 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 18038 tips and 18036 internal nodes ]
## [1] "TACGAAGGGGGCTAGCGTTGCTCGGAATTACTGGGCGTAAAGGGAGCGTAGGCGGACATTTAAGTCAGGGGTGAAATCCCGGGGCTCAACCTCGGAATTGCCTTTGATACTGGGTGTCTTGAGTATGAGAGAGGTGTGTGGAACTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACACACTGGCTCATTACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG"
## [2] "TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGCATGGCTAGAGTACGGTAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGACCTGTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG"
## [3] "TACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG"
As we can see, the taxon names are currently the associated sequences. We can create a new phyloseq component to store these sequences, then rename the ASVs so something shorter (“ASV_1”, “ASV_2”, etc.).
dna <- Biostrings::DNAStringSet(taxa_names(ps_raw))
names(dna) <- taxa_names(ps_raw)
# Merge sequence object into the phyloseq object:
ps_raw <- merge_phyloseq(ps_raw, dna)
ps_raw
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 18038 taxa and 75 samples ]
## sample_data() Sample Data: [ 75 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 18038 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 18038 tips and 18036 internal nodes ]
## refseq() DNAStringSet: [ 18038 reference sequences ]
# Rename ASVs:
taxa_names(ps_raw) <- paste("ASV", 1:ntaxa(ps_raw), sep = "_")
taxa_names(ps_raw)[1:3]
## [1] "ASV_1" "ASV_2" "ASV_3"
It is possible to introduce contaminating microbes during sample preparation. Before analyzing the data, we will identify and remove probable contaminants using the decontam package.
In this case, we will define a conataminant as an ASV whose abundance correlates with DNA concentration (post-PCR). Our assumption here is that each soil taxon’s abundance should be independant of DNA concentration. However, if we were to spike each sample with a contaminant, that contaminant would show a greater abundance in samples with lower DNA concentrations. To do this, we need DNA concentration measured after PCR and before pooling of the samples.
These data are in column 7 of the sample data component:
## SampleID Experiment Treatment Timepoint Block Plot PCR_prod_V4.V5
## 101-S1 101-S1 HCC T1 S1 R1 P101 0.464003
## 101-S3 101-S3 HCC T1 S3 R1 P101 1.000000
## 101-S4 101-S4 HCC T1 S4 R1 P101 0.629419
## 102-S3 102-S3 HCC T4 S3 R1 P102 0.925081
## 102-S4 102-S4 HCC T4 S4 R1 P102 0.746320
## 103-S1 103-S1 HCC T2 S1 R1 P103 0.556443
MCIC measured our DNA concentration by comparing band intensities to a single reference band. MCIC’s DNA concentration measurements range from 0 to 1 (the reference). Zero in this case is coded as NA
. The decontam package can’t handle NA’s or 0’s, so we need to change them to a small value (e.g. 0.01).
# First, we identify the NAs in the DNA measurement column:
NAs <- is.na(sample_data(ps_raw)$PCR_prod_V4.V5)
# Then, we replace NAs with 0.01:
sample_data(ps_raw)$PCR_prod_V4.V5[NAs] <- 0.01
We will use decontam::isContaminant(method = "frequency")
to test each taxon against the null hypothesis: abundance is not associated with DNA concentration. Because the DNA concentration measurements are not very accurate, and we want to make sure to remove possible contaminants, we use a relaxed p-value (0.2) to reject the null hypothesis:
contam_df_freq <- isContaminant(ps_raw,
method = "frequency",
conc = "PCR_prod_V4.V5",
threshold = 0.2)
head(contam_df_freq)
## freq prev p.freq p.prev p contaminant
## ASV_1 0.017740963 73 0.14200437 NA 0.14200437 TRUE
## ASV_2 0.013007310 73 0.08132858 NA 0.08132858 TRUE
## ASV_3 0.008321727 71 0.13124199 NA 0.13124199 TRUE
## ASV_4 0.006168516 72 0.94705975 NA 0.94705975 FALSE
## ASV_5 0.005170904 72 0.99239427 NA 0.99239427 FALSE
## ASV_6 0.005774690 37 0.62439728 NA 0.62439728 FALSE
As we can see, the top 3 ASVs are identified as probable contaminants. How many contaminants were identified (TRUE
in the contaminant
column) in total?
##
## FALSE TRUE
## 17800 238
What are the most abundant contaminant taxa?
## [1] 1 2 3 11 27 149
# Check which taxa are contaminants:
ps_contam <- prune_taxa(contam_df_freq$contaminant, ps_raw)
head(tax_table(ps_contam))
## Taxonomy Table: [6 taxa by 6 taxonomic ranks]:
## Kingdom Phylum Class
## ASV_1 "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## ASV_2 "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_3 "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_11 "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_27 "Bacteria" "Actinobacteria" "Actinobacteria"
## ASV_149 "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## Order Family
## ASV_1 "Caulobacterales" "Caulobacteraceae"
## ASV_2 "Betaproteobacteriales" "Burkholderiaceae"
## ASV_3 "Betaproteobacteriales" "Burkholderiaceae"
## ASV_11 "Betaproteobacteriales" "Burkholderiaceae"
## ASV_27 "Micrococcales" "Micrococcaceae"
## ASV_149 "Rhizobiales" "Xanthobacteraceae"
## Genus
## ASV_1 "Brevundimonas"
## ASV_2 "Delftia"
## ASV_3 "Burkholderia-Caballeronia-Paraburkholderia"
## ASV_11 "Delftia"
## ASV_27 "Renibacterium"
## ASV_149 NA
Let’s take a look at our tested associations for several taxa:
plot_frequency(ps_raw,
taxa_names(ps_raw)[c(1, 2, 3, 4, 5, 11)],
conc = "PCR_prod_V4.V5") +
xlab("DNA Concentration (relative to single marker)")
As we can see, abundance of ASVs 1, 2, 3, and 11 are particularly high in samples with low DNA concentration.
Which samples have highest abundances of the contaminants?
## OTU Table: [1 taxa and 6 samples]
## taxa are columns
## ASV_1
## 204-S3 6710
## blankM 5944
## 103-S3 4337
## 301-S3 3032
## 104-S3 1943
## 304-S3 1153
## OTU Table: [1 taxa and 6 samples]
## taxa are columns
## ASV_2
## 204-S3 6261
## blankM 4403
## 103-S3 3755
## 301-S3 1707
## 104-S3 683
## 304-S3 664
How do these contaminants overlap with ASVs present in the negative control?
## OTU Table: [11 taxa and 6 samples]
## taxa are columns
## ASV_1 ASV_2 ASV_3 ASV_4 ASV_5 ASV_6 ASV_7 ASV_8 ASV_9 ASV_10 ASV_11
## 604-S1 25 24 25 492 136 0 139 350 137 139 5
## 604-S2 31 10 11 183 108 21 159 189 79 251 3
## 604-S3 127 72 99 212 260 3 131 132 123 141 18
## 604-S4 46 32 6 101 251 0 54 54 73 66 30
## blankM 5944 4403 169 0 0 0 0 0 0 0 1476
## H20 0 0 0 0 0 0 9 0 0 0 0
ASVs have high read counts in the extraction negative control. Thus, we can be fairly certain these are contaminants and not representative of our experimental samples.
Remove the contaminants that we identified:
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 17800 taxa and 75 samples ]
## sample_data() Sample Data: [ 75 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 17800 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 17800 tips and 17798 internal nodes ]
## refseq() DNAStringSet: [ 17800 reference sequences ]
What proportion of our count data were removed as contaminants?
## [1] 0.03797457
The V4 region of 16S rRNA is conserved within certain bacteria, archaea, chloroplasts, mitochondria, and eukaryotes. Since we are only interested in bacteria and archaea here, we’ll need to remove other taxa from our object.
# Create ps subsets for chloroplast, mitochondria, and eukaryotes:
chlr <- subset_taxa(ps_noncontam, Order == "Chloroplast")
mit <- subset_taxa(ps_noncontam, Family == "Mitochondria")
euk <- subset_taxa(ps_noncontam, Kingdom == "Eukaryota")
# Get all taxon IDs that we want to remove:
bad_taxa <- c(taxa_names(chlr), taxa_names(mit), taxa_names(euk))
# Get all taxon IDs that we want to keep:
all_taxa <- taxa_names(ps_noncontam)
good_taxa <- all_taxa[!(all_taxa %in% bad_taxa)]
# Subset phyloseq object:
ps_bac_arc <- prune_taxa(good_taxa, ps_noncontam)
What proportion of ASVs were kept?
## [1] 0.9919178
Finally, we’ll also remove non-experimental samples (e.g. negative controls). Note, though, that non-experimental samples are important for evaluating the quality of your reagents, consistency across sequencing runs, and contamination.
## phyloseq-class experiment-level object
## otu_table() OTU Table: [ 17600 taxa and 73 samples ]
## sample_data() Sample Data: [ 73 samples by 7 sample variables ]
## tax_table() Taxonomy Table: [ 17600 taxa by 6 taxonomic ranks ]
## phy_tree() Phylogenetic Tree: [ 17600 tips and 17598 internal nodes ]
## refseq() DNAStringSet: [ 17600 reference sequences ]
After having filtered out unwanted taxa, we now need to filter out uninformative samples – those with low total taxon counts.
First, how many counts do we have for each sample?
## 403-S3 203-S3 103-S3 204-S3 104-S3 301-S1 301-S3 204-S1 503-S4 603-S3 302-S3
## 0 3258 4157 7687 7841 13441 14860 14897 15004 19423 20815
## 403-S1 502-S1 201-S3 402-S1 202-S1 603-S1 604-S4 303-S1 504-S4 404-S4 604-S3
## 21405 23248 23283 23283 24329 25145 25568 25708 25913 25948 26196
## 601-S4 203-S1 503-S2 102-S3 403-S4 402-S4 601-S1 201-S4 504-S1 401-S3 401-S4
## 26259 26598 26821 27558 28119 28340 28408 29079 29106 29306 29375
## 503-S1 204-S4 502-S4 303-S3 301-S4 204-S2 304-S4 403-S2 401-S1 202-S4 604-S1
## 29465 29882 30181 30201 30229 30575 31209 31351 31594 31978 32157
## 501-S4 603-S4 202-S3 604-S2 304-S3 104-S1 501S4 103-S2 301-S2 103-S4 303-S4
## 32480 32577 33410 33618 33689 33852 33951 34076 34557 34714 35339
## 404-S3 203-S4 602-S4 101-S3 402-S3 302-S4 501-S3 104-S4 103-S1 102-S4 502-S3
## 36038 36261 36653 36786 36799 38114 38300 38725 38928 40180 40416
## 304-S1 101-S1 504-S3 602-S3 101-S4 601-S3 503-S3
## 41181 50997 52898 54757 56340 57665 67738
## [1] 0
## [1] "403-S3"
To remove uninformative samples, we will only keep those with over 1,000 counts. We will also remove a technical replicate, though note that technical replicates are important for quality control of runs and for understanding the reproducibility of the methodology.
# Remove samples with low counts:
ps <- subset_samples(ps, sample_sums(ps) > 1000)
# Remove the technical replicate:
ps <- subset_samples(ps, SampleID != "501S4")
Let’s take a look at our current phyloseq
object:
## SampleID Experiment Treatment Timepoint Block Plot PCR_prod_V4.V5
## 101-S1 101-S1 HCC T1 S1 R1 P101 0.464003
## 101-S3 101-S3 HCC T1 S3 R1 P101 1.000000
## 101-S4 101-S4 HCC T1 S4 R1 P101 0.629419
## 102-S3 102-S3 HCC T4 S3 R1 P102 0.925081
## 102-S4 102-S4 HCC T4 S4 R1 P102 0.746320
## 103-S1 103-S1 HCC T2 S1 R1 P103 0.556443
## 103-S2 103-S2 HCC T2 S2 R1 P103 0.741777
## 103-S3 103-S3 HCC T2 S3 R1 P103 0.155553
## 103-S4 103-S4 HCC T2 S4 R1 P103 0.768266
## 104-S1 104-S1 HCC T3 S1 R1 P104 0.742849
## OTU Table: [6 taxa and 10 samples]
## taxa are rows
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1 103-S2 103-S3 103-S4 104-S1
## ASV_4 556 306 241 128 122 343 493 5 81 375
## ASV_5 227 306 255 254 185 189 259 30 155 190
## ASV_7 98 165 93 278 101 130 286 41 84 116
## ASV_8 423 187 188 133 129 197 329 0 65 257
## ASV_9 153 196 260 153 119 124 162 9 66 124
## ASV_10 192 192 319 72 186 73 69 1 153 119
## Taxonomy Table: [6 taxa by 6 taxonomic ranks]:
## Kingdom Phylum Class
## ASV_4 "Bacteria" "Acidobacteria" "Blastocatellia_(Subgroup_4)"
## ASV_5 "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_7 "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## ASV_8 "Bacteria" "Acidobacteria" "Blastocatellia_(Subgroup_4)"
## ASV_9 "Bacteria" "Nitrospirae" "Nitrospira"
## ASV_10 "Bacteria" "Bacteroidetes" "Bacteroidia"
## Order Family Genus
## ASV_4 "Pyrinomonadales" "Pyrinomonadaceae" "RB41"
## ASV_5 "Betaproteobacteriales" "SC-I-84" NA
## ASV_7 "Rhizobiales" "Xanthobacteraceae" "Pseudolabrys"
## ASV_8 "Pyrinomonadales" "Pyrinomonadaceae" "RB41"
## ASV_9 "Nitrospirales" "Nitrospiraceae" "Nitrospira"
## ASV_10 "Chitinophagales" "Chitinophagaceae" NA
Normalization is currently a much-discussed issue of microbiome studies. Differences in read depth between samples often need to be corrected before analysis. Several normalization methods have been proposed, and no single method is perfect. It may be that the most appropriate method depends on the analysis.
In this tutorial, we will use:
Proportion-normalized data to estimate ecological metrics.
Stabilizing transformation normalized data to identify differentially abundant taxa.
However, for your reference, we will also present code for two other normalization methods: rarefaction and cumulative sum scaling.
For more details, read McMurdie and Holmes 2014 and McKnight et al. 2019.
Proportion normalization involves dividing each OTU count by the total sum for each sample. The resulting count data will add up to 1 (100%) for each sample.
The microbiome::transform
function can be used to easily normalize count data as proportions in a phyloseq object:
# Proportion normalization:
ps_prop <- transform(ps, "compositional")
# Have a look at the resulting OTU table:
otu_table(ps_prop)[1:5, 1:5]
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1 0.010902602 0.004451242 0.001921682 0.008294606 0.003000176
## 101-S3 0.008318382 0.008318382 0.004485402 0.005083456 0.005328114
## 101-S4 0.004277600 0.004526092 0.001650692 0.003336883 0.004614838
## 102-S3 0.004644749 0.009216924 0.010087815 0.004826185 0.005551927
## 102-S4 0.003036336 0.004604281 0.002513688 0.003210553 0.002961672
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1
## 1 1 1 1 1 1
This transformation method, available in DESeq2, normalizes data with respect to library size as well as dispersion.
Type ?vst
or read Anders and Huber 2010 for more information.
# First we convert the phyloseq object to a DESeq object:
dds <- phyloseq_to_deseq2(ps, ~1)
# Then we estimate the size factors:
dds <- estimateSizeFactors(dds)
# Now can do the transformation:
dds_vst <- vst(dds, blind = TRUE)
# Have a look at the resulting counts:
vst_counts <- assay(dds_vst)
t(vst_counts)[1:5, 1:5]
## ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1 8.002095 6.729126 5.559152 7.611915 6.175558
## 101-S3 7.459397 7.459397 6.585002 6.761345 6.827700
## 101-S4 7.388440 7.468770 6.047047 7.035948 7.496408
## 102-S3 6.756474 7.729205 7.858050 6.810556 7.008636
## 102-S4 7.108518 7.700594 6.841187 7.187648 7.073230
Rarefaction can be used to subset data such that the library depth is the same for each sample. Because sampling of the data is random, rarefaction can account for an effect of total read count on taxa richness. However, rarefaction is no longer considered to be a good way to normalize amplicon sequencing data (see McMurdie & Holmes 2014).
ps_rarefied <- rarefy_even_depth(ps,
rngseed = 1,
sample.size = min(sample_sums(ps)),
replace = FALSE)
# Have a look at the result:
otu_table(ps_rarefied)[1:5, 1:5]
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1 24 13 12 32 7
## 101-S3 26 22 20 9 19
## 101-S4 15 16 6 12 17
## 102-S3 15 26 33 19 11
## 102-S4 6 15 5 9 18
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1
## 3258 3258 3258 3258 3258 3258
Compare these sample sums to the non-rarified data sums:
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1
## 50997 36786 56340 27558 40180 38928
The metagenomeSeq Cumulative Sum Scaling (CSS) normalization is another option developed for microbiome data. For more information, read Paulson et al. 2013.
# Convert the phyloseq object to a metagenomeseq object:
mgs_css <- phyloseq_to_metagenomeSeq(ps)
# Perform the Cumulative Sum Scaling:
mgs_css <- cumNorm(mgs_css)
# Extract the counts and add them to a separate phyloseq object:
css_counts <- MRcounts(mgs_css, norm = TRUE)
ps_css <- ps
otu_table(ps_css) <- otu_table(t(css_counts), taxa_are_rows = FALSE)
Now lets compare the original data to the CSS normalized data:
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1 556 227 98 423 153
## 101-S3 306 306 165 187 196
## 101-S4 241 255 93 188 260
## 102-S3 128 254 278 133 153
## 102-S4 122 185 101 129 119
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1
## 50997 36786 56340 27558 40180 38928
## OTU Table: [5 taxa and 5 samples]
## taxa are columns
## ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1 86.05479 35.13388 15.16793 65.46974 23.68054
## 101-S3 51.81172 51.81172 27.93769 31.66272 33.18659
## 101-S4 37.81579 40.01255 14.59281 29.49945 40.79711
## 102-S3 23.92971 47.48551 51.97233 24.86446 28.60348
## 102-S4 19.34052 29.32784 16.01141 20.45022 18.86493
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1
## 7893.051 6228.581 8840.421 5151.991 6369.689 6386.874
In this and following the examples we will use the proportions data.
# Subset to the T1 treatment
T1 <- subset_samples(ps_prop, Treatment == "T1")
# Agglomerate taxa to phylum level:
T1_phylum <- tax_glom(T1, taxrank = "Phylum", NArm = FALSE)
# Remove uncommon phyla:
T1_phylum <- subset_taxa(T1_phylum, taxa_sums(T1_phylum) > 0.1)
# Plot:
plot_bar(T1_phylum, fill = "Phylum") +
facet_wrap(~Timepoint, scales = "free_x", nrow = 1)
Alpha diversity measures taxonomic diversity within a single population. Measures of alpha diversity include taxonomic richness (i.e. number of taxa), and indices which combine taxonomic richness with some measure of evenness.
Note: Many of these methods are greatly influenced by singleton data (i.e. taxa represented by a single count), meaning they may be unreliable if your data excludes singletons. DADA2 does not output singletons if it is run individually on each sample. Thus, if using dada2 to infer ASVs, it is advisable to use pool = TRUE
when running the dada algorithm. This will allow calling per-sample singletons, but not per-study singletons.
Our goal in defining richness is to determine the number of unique taxa within each population (in our case, each plot). This is difficult, as our samples contain only a portion of the total richness within our plots. As sequence depth increases for each sample, so does the number of taxa. This can be illustrated with a rarefaction plot.
To make a rarefaction plot, we draw random samples from our data and count the number of unique ASVs as samples are drawn. The resulting rarefaction curve is expected to rise quickly then plateu as the most abundant taxa are represented. We can make a quick rarefaction curve plot directly from our phyloseq
object of all samples using the vegan
package:
We can also split rarefaction curves by group using the ampvis2
package.
metadata <- data.frame(sample_data(ps), check.names = FALSE)
asv_table <- data.frame(t(otu_table(ps)),
tax_table(ps),
check.names = FALSE)
ampvis2
requires a Species
column, which must be added to our data. If your table already has a Species
column, skip this step:
Load an ampvis2
object and calculate rarefaction:
ps3 <- amp_load(asv_table, metadata)
rar_plot <- amp_rarecurve(ps3,
stepsize = 100,
facet_by = "Timepoint",
color_by = "Treatment") +
ylab("Number of observed ASVs")
rar_plot
As we can see, as read depth increases for each sample, so does taxonomic richness. In all samples, increased sampling would result in increased richness. Thus, our observed sample richness is lower than the true population richness for every plot.
Further, sequence depth and total richness appear to be correlated across samples:
dp_by_asv <- data.frame(ct_sums = sample_sums(ps),
asvs = rowSums(otu_table(ps) != 0))
ggplot(data = dp_by_asv) +
geom_point(aes(x = ct_sums, y = asvs)) +
labs(x = "Sequence depth", y= "ASV count") +
theme_minimal()
We can test whether this correlation is significant:
## Warning in cor.test.default(dp_by_asv$ct_sums, dp_by_asv$asvs, method =
## "spearman"): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: dp_by_asv$ct_sums and dp_by_asv$asvs
## S = 17650, p-value = 7.412e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## 0.7040636
Here we observe a significant correlation between read count and observed taxonomic richness.
Thus, a sample from a population with high actual diversity, but low read depth, could have low observed richness. Fortunately, there are methods to estimate the number of unobserved taxa. For an overview of these methods, read Bunge et al. 2014.
To illustrate the difficulty of estimating population richness and importance of singletons to alpha diversity metrics, we’ll look at the frequency count distribution for our first sample: 101-S1.
We’ll count the number of singletons, doubletons, etc., then plot frequency as a function of count.
# Get the frequency count distribution for the 1st sample:
frequencytablelist <- build_frequency_count_tables(otu_table(ps))
freq_ct <- frequencytablelist[[1]]
colnames(freq_ct) <- c("Frequency", "Count")
# Plot:
ggplot(freq_ct, aes(Frequency, Count)) +
geom_point() +
scale_x_continuous(breaks = seq(0, 600, by = 50)) +
ggtitle("101-S1") +
theme_minimal()
We can see there are ~3,200 taxa represented as singletons, and ~1,600 taxa as doubletons. This is a high proportion of the total number taxa are observed in this sample:
## [1] 8646
## [1] 0.3693037
How many more taxa might we observe if we increased our sampling?
We can use several indices to explore how evenly species are distributed within each sample. Note that ‘Observed’ is the sample taxonomic richness. The other three indices combine richness and abundance data for each taxon.
plot_richness(ps,
x = "Timepoint",
shape = "Treatment",
color = "Treatment",
measures = c("Observed", "Shannon", "Simpson", "InvSimpson"))
Notice that several T2:S3 samples have low measures, suggesting they have low diversity (i.e. abundances are dominated by a few taxa) compared to the other samples.
Beta diversity measures differences in microbial compositions between populations. If available (use ps_16S_V4_withtree.rds), we can use a phylogenetic tree and otu table to estimate the UniFrac for each sample pair.
The result is a distance matrix with height and width equal to the number of samples. UniFrac may be weighted or unweighted. Selection of weighted or unweighted will depend on the question you want to answer. Unweighted UniFrac uses only presence or absence of a taxon from the otu table (i.e. 1000 will be treated the same as 1). It is appropriate if you want to test qualitative taxa differences between samples. Weighted UniFrac uses presence as well as quantity data (i.e. 1 < 1000).
We are interested to know what changes to taxon abundance may be driven by our treatments, so we will use weighted UniFrac.
First we’ll make a weighted UniFrac distance matrix. For UniFrac, you require a tree as part of your phyloseq object.
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1
## 101-S1 0.00000000 0.09488256 0.1767225 0.10183595 0.1973141 0.08065589
## 101-S3 0.09488256 0.00000000 0.1403436 0.06335683 0.1611763 0.10207325
## 101-S4 0.17672254 0.14034362 0.0000000 0.16667520 0.0712517 0.16746555
## 102-S3 0.10183595 0.06335683 0.1666752 0.00000000 0.1809738 0.09720146
## 102-S4 0.19731406 0.16117633 0.0712517 0.18097382 0.0000000 0.18488558
## 103-S1 0.08065589 0.10207325 0.1674656 0.09720146 0.1848856 0.00000000
# Note: If your *phyloseq* object lacks a phylogenetic tree,
# you could use a dissimilarity index that does not require one, such as Bray-Curtis:
# bray <- vegan::vegdist(otu_table(ps_prop), method = "bray")
# as.matrix(bray)[1:6,1:6]
Then perform Principal Coordinate Analysis (PCoA) on your favorite distance matrix. In this case, we are using weighted UniFrac.
do <- ordinate(ps_prop, method = "PCoA", distance = wUF)
baseplot <- plot_ordination(ps_prop, do,
type = "samples",
color = "Treatment",
shape = "Timepoint") +
ggtitle("HCC ordination, all samples\nmethod=PCoA, distance=weighted UniFrac") +
theme_bw()
# Single plot with all samples:
baseplot +
geom_point(size = 2)
# Idem, but with sample names on the plot:
baseplot +
geom_point(size = 1) +
geom_label(aes(label = SampleID), size = 3.5, color = "black")
# Facet by time point:
baseplot +
geom_point(size = 2) +
ggtitle("HCC ordination, all samples\nmethod=PCoA, distance=weighted UniFrac") +
facet_wrap(~Timepoint, 2)
We can see some grouping by timepoint. Depending on your experiment and questions, you might want to consider other ordination or clustering approaches.
Finally, we’ll test whether there is a significant ecological level treatment effect. PERMANOVA sounds fancy, but it is just an ANOVA performed using permutations. Permutations are used to determine how data may appear if there is no treatment effect and group differences are due to random chance. Observed data are then compared to the randomized data to calculate a p-value.
Treatments are independent within Timepoints, so we will first subset the data by Timepoint:
ps_S1 <- subset_samples(ps_prop, Timepoint == "S1")
ps_S3 <- subset_samples(ps_prop, Timepoint == "S3")
ps_S4 <- subset_samples(ps_prop, Timepoint == "S4")
Subset the metadata and estimate UniFrac for S3:
Then we’ll use vegan::adonis2
to perform the PERMANOVA.
set.seed(100) # Set the seed to make the analysis reproducible
permanova <- adonis2(unifrac_dist ~ Treatment,
data = metadata,
permutations = 10000)
permanova
## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 10000
##
## adonis2(formula = unifrac_dist ~ Treatment, data = metadata, permutations = 10000)
## Df SumOfSqs R2 F Pr(>F)
## Treatment 3 0.067035 0.24478 2.0528 0.0035 **
## Residual 19 0.206821 0.75522
## Total 22 0.273856 1.00000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The low p-value suggests that within the S3 timepoint (post-termination), there is a significant treatment effect.
Here, we will use DESeq2 to identify differentially abundant taxa between time points and treatments. DESeq2 was designed to analyze RNAseq datasets, which are similar to OTU/ASV data sets in that both handle large, sparse contingency tables generated from Illumina sequencing data. For more details, see McMurdie and Holmes 2014.
First, we’ll convert our non-normalized count data to a DESeq object.
## class: DESeqDataSet
## dim: 17600 71
## metadata(1): version
## assays(1): counts
## rownames(17600): ASV_4 ASV_5 ... ASV_18037 ASV_18038
## rowData names(0):
## colnames(71): 101-S1 101-S3 ... 604-S3 604-S4
## colData names(7): SampleID Experiment ... Plot PCR_prod_V4.V5
## DataFrame with 71 rows and 7 columns
## SampleID Experiment Treatment Timepoint Block Plot PCR_prod_V4.V5
## <factor> <factor> <factor> <factor> <factor> <factor> <numeric>
## 101-S1 101-S1 HCC T1 S1 R1 P101 0.464003
## 101-S3 101-S3 HCC T1 S3 R1 P101 1.000000
## 101-S4 101-S4 HCC T1 S4 R1 P101 0.629419
## 102-S3 102-S3 HCC T4 S3 R1 P102 0.925081
## 102-S4 102-S4 HCC T4 S4 R1 P102 0.746320
## ... ... ... ... ... ... ... ...
## 603-S4 603-S4 HCC T3 S4 R6 P603 0.796571
## 604-S1 604-S1 HCC T2 S1 R6 P604 0.779915
## 604-S2 604-S2 HCC T2 S2 R6 P604 0.730354
## 604-S3 604-S3 HCC T2 S3 R6 P604 0.926025
## 604-S4 604-S4 HCC T2 S4 R6 P604 0.817147
There are two ways to analyze interaction effects using DESeq2. The first is to fit a multivariate model (e.g. ~A+B+A:B) and explore the model coefficients. The second is to fit a univariate model and explore pairwise contrasts. Here, we will group our two factors (Treatment and Timepoint) and use the latter approach.
# Prepare the data:
dds$Group <- factor(paste0(dds$Treatment, dds$Timepoint))
design(dds) <- formula(~Group)
dds <- dds[, -50] # Remove tech_rep sample
dds$Group <- droplevels(dds$Group) # Drop tech_rep level from design matrix
# Perform the differential abundance analysis:
dds <- DESeq(dds)
results(dds)
## log2 fold change (MLE): Group T4S4 vs T1S1
## Wald test p-value: Group T4S4 vs T1S1
## DataFrame with 17600 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## ASV_4 163.314 -0.5518849 0.325626 -1.6948453 0.0901047681701
## ASV_5 161.981 1.6123820 0.282434 5.7088821 0.0000000113721
## ASV_7 115.697 0.7160631 0.322469 2.2205629 0.0263805813492
## ASV_8 104.576 0.0111895 0.371903 0.0300873 0.9759974233061
## ASV_9 107.540 0.7551121 0.235185 3.2107216 0.0013240215163
## ... ... ... ... ... ...
## ASV_18034 0.0339025 0.862942 7.06641 0.1221189 0.902805
## ASV_18035 0.0339025 0.862942 7.06641 0.1221189 0.902805
## ASV_18036 0.0124023 0.382212 7.06641 0.0540885 0.956865
## ASV_18037 0.0243719 0.862942 7.06641 0.1221189 0.902805
## ASV_18038 0.0243719 0.862942 7.06641 0.1221189 0.902805
## padj
## <numeric>
## ASV_4 0.20919428694
## ASV_5 0.00000586761
## ASV_7 0.09404214332
## ASV_8 0.98632468985
## ASV_9 0.01382801753
## ... ...
## ASV_18034 NA
## ASV_18035 NA
## ASV_18036 NA
## ASV_18037 NA
## ASV_18038 NA
Let’s explore the contrast, “T1S4 : T4S4” (no cc, seedling stage : late termination, seedling stage):
##
## out of 17572 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up) : 0, 0%
## LFC < 0 (down) : 1, 0.0057%
## outliers [1] : 46, 0.26%
## low counts [2] : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
# Number of taxa with significantly higher abundance in T4S4:
sum(res$padj < 0.1 & res$log2FoldChange > 0, na.rm = TRUE)
## [1] 0
# Number of taxa with significantly lower abundance in T4S4:
sum(res$padj < 0.1 & res$log2FoldChange < 0, na.rm = TRUE)
## [1] 1
We can sort results by p-value and look at the top ASVs:
## log2 fold change (MLE): Group T1S4 vs T4S4
##
## DataFrame with 6 rows and 3 columns
## baseMean log2FoldChange padj
## <numeric> <numeric> <numeric>
## ASV_3479 3.27868 -36.83123 4.44874e-12
## ASV_1740 2.85123 -3.53555 1.00000e+00
## ASV_1623 3.90335 1.91786 1.00000e+00
## ASV_336 9.93278 1.83339 1.00000e+00
## ASV_1950 3.67406 -2.23211 1.00000e+00
## ASV_487 9.63303 -1.24601 1.00000e+00
And plot counts for the first in the ordered list ASV:
We can also plot counts by sample_ID:
The plot can also be saved and customized with ggplot:
d <- plotCounts(dds, gene = "ASV_48", intgroup = "Group", returnData = TRUE)
ggplot(d, aes(x = Group, y = count)) +
geom_point(position = position_jitter(w = 0.1, h = 0), size = 2) +
scale_y_log10() +
ggtitle(expression(paste("ASV 48, log"[10], "-scale"))) +
theme_bw()
We can also subset our differentially abundant taxa to keep those with an adjusted p-value < 0.05:
## log2 fold change (MLE): Group T1S4 vs T4S4
## Wald test p-value: Group T1S4 vs T4S4
## DataFrame with 1 row and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## ASV_3479 3.27868 -36.8312 4.4952 -8.19345 2.53837e-16 4.44874e-12
What taxa are associated with significantly differentially abundant ASVs?
## Taxonomy Table: [1 taxa by 6 taxonomic ranks]:
## Kingdom Phylum Class Order
## ASV_3479 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
## Family Genus
## ASV_3479 "Clostridiaceae_1" "Clostridium_sensu_stricto_9"
Report that we are done!
## [1] "Done with ASV inference."
Articles on data normalization: