Analyze microbiome experimental data as a phyloseq object - explore ecological metrics and identify differentially abundant taxa.
This is a slightly modified version of a document created by Matthew Willman.
To convert an Rmd (R Markdown) file to an R script, type the following in an R console: knitr::purl(input="<filename>.Rmd"). (You can download this Rmd file by clicking the Code button in the top-right of this page – but we will open it at OSC.)
This assumes you still have an active RStudio Server job at OSC; if not see the instructions on the previous page to start one.
In the Files pane, in the markdown directory, open the file 08-postASV-analysis.Rmd. That’s this file!
Add our custom library again if you restarted your R session:
And load the necessary packages:
packages <- c('tidyverse', 'vegan', 'phyloseq', 'decontam', 'ape',
              'DESeq2', 'microbiome', 'metagenomeSeq', 'remotes',
              'ampvis2', 'breakaway')
pacman::p_load(char = packages)
# If you wanted to install and load these packages yourself,
# first make sure you have the pacman package installed:
## install.packages('pacman')
# Then, the code above would work except for the last two packages,
# which are available from Github only. To install these, you would run:
## remotes::install_github("MadsAlbertsen/ampvis2")
## remotes::install_github("adw96/breakaway")This session starts with a phyloseq object similar to the one generated using DADA2 in the previous session.
ps_raw <- readRDS(file.path(indir, 'ps_16S_V4_withtree.rds'))
# This would load the object from the previous session, which doesn't contain a tree:
# ps_raw <- readRDS(file.path(indir, 'ps_V4.rds'))Our phyloseq object is made up of the following components (slots):
otu_tablesample_datatax_tablephy_treeLet’s have a look at the object:
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18038 taxa and 75 samples ]
## sample_data() Sample Data:       [ 75 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 18038 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 18038 tips and 18036 internal nodes ]## [1] "TACGAAGGGGGCTAGCGTTGCTCGGAATTACTGGGCGTAAAGGGAGCGTAGGCGGACATTTAAGTCAGGGGTGAAATCCCGGGGCTCAACCTCGGAATTGCCTTTGATACTGGGTGTCTTGAGTATGAGAGAGGTGTGTGGAACTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAACACCAGTGGCGAAGGCGACACACTGGCTCATTACTGACGCTGAGGCTCGAAAGCGTGGGGAGCAAACAGG"
## [2] "TACGTAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTATGTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGCATGGCTAGAGTACGGTAGAGGGGGATGGAATTCCGCGTGTAGCAGTGAAATGCGTAGATATGCGGAGGAACACCGATGGCGAAGGCAATCCCCTGGACCTGTACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG"
## [3] "TACGTAGGGTGCGAGCGTTAATCGGAATTACTGGGCGTAAAGCGTGCGCAGGCGGTTTGTTAAGACAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATTTGTGACTGGCAGGCTAGAGTATGGCAGAGGGGGGTAGAATTCCACGTGTAGCAGTGAAATGCGTAGAGATGTGGAGGAATACCGATGGCGAAGGCAGCCCCCTGGGCCAATACTGACGCTCATGCACGAAAGCGTGGGGAGCAAACAGG"As we can see, the taxon names are currently the associated sequences. We can create a new phyloseq component to store these sequences, then rename the ASVs so something shorter (“ASV_1”, “ASV_2”, etc.).
dna <- Biostrings::DNAStringSet(taxa_names(ps_raw))
names(dna) <- taxa_names(ps_raw)
# Merge sequence object into the phyloseq object:
ps_raw <- merge_phyloseq(ps_raw, dna)
ps_raw## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 18038 taxa and 75 samples ]
## sample_data() Sample Data:       [ 75 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 18038 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 18038 tips and 18036 internal nodes ]
## refseq()      DNAStringSet:      [ 18038 reference sequences ]# Rename ASVs:
taxa_names(ps_raw) <- paste("ASV", 1:ntaxa(ps_raw), sep = "_")
taxa_names(ps_raw)[1:3]## [1] "ASV_1" "ASV_2" "ASV_3"It is possible to introduce contaminating microbes during sample preparation. Before analyzing the data, we will identify and remove probable contaminants using the decontam package.
In this case, we will define a conataminant as an ASV whose abundance correlates with DNA concentration (post-PCR). Our assumption here is that each soil taxon’s abundance should be independant of DNA concentration. However, if we were to spike each sample with a contaminant, that contaminant would show a greater abundance in samples with lower DNA concentrations. To do this, we need DNA concentration measured after PCR and before pooling of the samples.
These data are in column 7 of the sample data component:
##        SampleID Experiment Treatment Timepoint Block Plot PCR_prod_V4.V5
## 101-S1   101-S1        HCC        T1        S1    R1 P101       0.464003
## 101-S3   101-S3        HCC        T1        S3    R1 P101       1.000000
## 101-S4   101-S4        HCC        T1        S4    R1 P101       0.629419
## 102-S3   102-S3        HCC        T4        S3    R1 P102       0.925081
## 102-S4   102-S4        HCC        T4        S4    R1 P102       0.746320
## 103-S1   103-S1        HCC        T2        S1    R1 P103       0.556443MCIC measured our DNA concentration by comparing band intensities to a single reference band. MCIC’s DNA concentration measurements range from 0 to 1 (the reference). Zero in this case is coded as NA. The decontam package can’t handle NA’s or 0’s, so we need to change them to a small value (e.g. 0.01).
# First, we identify the NAs in the DNA measurement column:
NAs <- is.na(sample_data(ps_raw)$PCR_prod_V4.V5)
# Then, we replace NAs with 0.01:
sample_data(ps_raw)$PCR_prod_V4.V5[NAs] <- 0.01We will use decontam::isContaminant(method = "frequency") to test each taxon against the null hypothesis: abundance is not associated with DNA concentration. Because the DNA concentration measurements are not very accurate, and we want to make sure to remove possible contaminants, we use a relaxed p-value (0.2) to reject the null hypothesis:
contam_df_freq <- isContaminant(ps_raw,
                                method = "frequency",
                                conc = "PCR_prod_V4.V5",
                                threshold = 0.2)
head(contam_df_freq)##              freq prev     p.freq p.prev          p contaminant
## ASV_1 0.017740963   73 0.14200437     NA 0.14200437        TRUE
## ASV_2 0.013007310   73 0.08132858     NA 0.08132858        TRUE
## ASV_3 0.008321727   71 0.13124199     NA 0.13124199        TRUE
## ASV_4 0.006168516   72 0.94705975     NA 0.94705975       FALSE
## ASV_5 0.005170904   72 0.99239427     NA 0.99239427       FALSE
## ASV_6 0.005774690   37 0.62439728     NA 0.62439728       FALSEAs we can see, the top 3 ASVs are identified as probable contaminants. How many contaminants were identified (TRUE in the contaminant column) in total?
## 
## FALSE  TRUE 
## 17800   238What are the most abundant contaminant taxa?
## [1]   1   2   3  11  27 149# Check which taxa are contaminants:
ps_contam <- prune_taxa(contam_df_freq$contaminant, ps_raw)
head(tax_table(ps_contam))## Taxonomy Table:     [6 taxa by 6 taxonomic ranks]:
##         Kingdom    Phylum           Class                
## ASV_1   "Bacteria" "Proteobacteria" "Alphaproteobacteria"
## ASV_2   "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_3   "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_11  "Bacteria" "Proteobacteria" "Gammaproteobacteria"
## ASV_27  "Bacteria" "Actinobacteria" "Actinobacteria"     
## ASV_149 "Bacteria" "Proteobacteria" "Alphaproteobacteria"
##         Order                   Family             
## ASV_1   "Caulobacterales"       "Caulobacteraceae" 
## ASV_2   "Betaproteobacteriales" "Burkholderiaceae" 
## ASV_3   "Betaproteobacteriales" "Burkholderiaceae" 
## ASV_11  "Betaproteobacteriales" "Burkholderiaceae" 
## ASV_27  "Micrococcales"         "Micrococcaceae"   
## ASV_149 "Rhizobiales"           "Xanthobacteraceae"
##         Genus                                       
## ASV_1   "Brevundimonas"                             
## ASV_2   "Delftia"                                   
## ASV_3   "Burkholderia-Caballeronia-Paraburkholderia"
## ASV_11  "Delftia"                                   
## ASV_27  "Renibacterium"                             
## ASV_149 NALet’s take a look at our tested associations for several taxa:
plot_frequency(ps_raw,
               taxa_names(ps_raw)[c(1, 2, 3, 4, 5, 11)],
               conc = "PCR_prod_V4.V5") +
  xlab("DNA Concentration (relative to single marker)")As we can see, abundance of ASVs 1, 2, 3, and 11 are particularly high in samples with low DNA concentration.
Which samples have highest abundances of the contaminants?
## OTU Table:          [1 taxa and 6 samples]
##                      taxa are columns
##        ASV_1
## 204-S3  6710
## blankM  5944
## 103-S3  4337
## 301-S3  3032
## 104-S3  1943
## 304-S3  1153## OTU Table:          [1 taxa and 6 samples]
##                      taxa are columns
##        ASV_2
## 204-S3  6261
## blankM  4403
## 103-S3  3755
## 301-S3  1707
## 104-S3   683
## 304-S3   664How do these contaminants overlap with ASVs present in the negative control?
## OTU Table:          [11 taxa and 6 samples]
##                      taxa are columns
##        ASV_1 ASV_2 ASV_3 ASV_4 ASV_5 ASV_6 ASV_7 ASV_8 ASV_9 ASV_10 ASV_11
## 604-S1    25    24    25   492   136     0   139   350   137    139      5
## 604-S2    31    10    11   183   108    21   159   189    79    251      3
## 604-S3   127    72    99   212   260     3   131   132   123    141     18
## 604-S4    46    32     6   101   251     0    54    54    73     66     30
## blankM  5944  4403   169     0     0     0     0     0     0      0   1476
## H20        0     0     0     0     0     0     9     0     0      0      0ASVs have high read counts in the extraction negative control. Thus, we can be fairly certain these are contaminants and not representative of our experimental samples.
Remove the contaminants that we identified:
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 17800 taxa and 75 samples ]
## sample_data() Sample Data:       [ 75 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 17800 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 17800 tips and 17798 internal nodes ]
## refseq()      DNAStringSet:      [ 17800 reference sequences ]What proportion of our count data were removed as contaminants?
## [1] 0.03797457The V4 region of 16S rRNA is conserved within certain bacteria, archaea, chloroplasts, mitochondria, and eukaryotes. Since we are only interested in bacteria and archaea here, we’ll need to remove other taxa from our object.
# Create ps subsets for chloroplast, mitochondria, and eukaryotes:
chlr <- subset_taxa(ps_noncontam, Order == "Chloroplast")
mit <- subset_taxa(ps_noncontam, Family == "Mitochondria")
euk <- subset_taxa(ps_noncontam, Kingdom == "Eukaryota")
# Get all taxon IDs that we want to remove:
bad_taxa <- c(taxa_names(chlr), taxa_names(mit), taxa_names(euk))
# Get all taxon IDs that we want to keep:
all_taxa <- taxa_names(ps_noncontam)
good_taxa <- all_taxa[!(all_taxa %in% bad_taxa)]
# Subset phyloseq object:
ps_bac_arc <- prune_taxa(good_taxa, ps_noncontam)What proportion of ASVs were kept?
## [1] 0.9919178Finally, we’ll also remove non-experimental samples (e.g. negative controls). Note, though, that non-experimental samples are important for evaluating the quality of your reagents, consistency across sequencing runs, and contamination.
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 17600 taxa and 73 samples ]
## sample_data() Sample Data:       [ 73 samples by 7 sample variables ]
## tax_table()   Taxonomy Table:    [ 17600 taxa by 6 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 17600 tips and 17598 internal nodes ]
## refseq()      DNAStringSet:      [ 17600 reference sequences ]After having filtered out unwanted taxa, we now need to filter out uninformative samples – those with low total taxon counts.
First, how many counts do we have for each sample?
## 403-S3 203-S3 103-S3 204-S3 104-S3 301-S1 301-S3 204-S1 503-S4 603-S3 302-S3 
##      0   3258   4157   7687   7841  13441  14860  14897  15004  19423  20815 
## 403-S1 502-S1 201-S3 402-S1 202-S1 603-S1 604-S4 303-S1 504-S4 404-S4 604-S3 
##  21405  23248  23283  23283  24329  25145  25568  25708  25913  25948  26196 
## 601-S4 203-S1 503-S2 102-S3 403-S4 402-S4 601-S1 201-S4 504-S1 401-S3 401-S4 
##  26259  26598  26821  27558  28119  28340  28408  29079  29106  29306  29375 
## 503-S1 204-S4 502-S4 303-S3 301-S4 204-S2 304-S4 403-S2 401-S1 202-S4 604-S1 
##  29465  29882  30181  30201  30229  30575  31209  31351  31594  31978  32157 
## 501-S4 603-S4 202-S3 604-S2 304-S3 104-S1  501S4 103-S2 301-S2 103-S4 303-S4 
##  32480  32577  33410  33618  33689  33852  33951  34076  34557  34714  35339 
## 404-S3 203-S4 602-S4 101-S3 402-S3 302-S4 501-S3 104-S4 103-S1 102-S4 502-S3 
##  36038  36261  36653  36786  36799  38114  38300  38725  38928  40180  40416 
## 304-S1 101-S1 504-S3 602-S3 101-S4 601-S3 503-S3 
##  41181  50997  52898  54757  56340  57665  67738## [1] 0## [1] "403-S3"To remove uninformative samples, we will only keep those with over 1,000 counts. We will also remove a technical replicate, though note that technical replicates are important for quality control of runs and for understanding the reproducibility of the methodology.
# Remove samples with low counts:
ps <- subset_samples(ps, sample_sums(ps) > 1000)
# Remove the technical replicate:
ps <- subset_samples(ps, SampleID != "501S4")Let’s take a look at our current phyloseq object:
##        SampleID Experiment Treatment Timepoint Block Plot PCR_prod_V4.V5
## 101-S1   101-S1        HCC        T1        S1    R1 P101       0.464003
## 101-S3   101-S3        HCC        T1        S3    R1 P101       1.000000
## 101-S4   101-S4        HCC        T1        S4    R1 P101       0.629419
## 102-S3   102-S3        HCC        T4        S3    R1 P102       0.925081
## 102-S4   102-S4        HCC        T4        S4    R1 P102       0.746320
## 103-S1   103-S1        HCC        T2        S1    R1 P103       0.556443
## 103-S2   103-S2        HCC        T2        S2    R1 P103       0.741777
## 103-S3   103-S3        HCC        T2        S3    R1 P103       0.155553
## 103-S4   103-S4        HCC        T2        S4    R1 P103       0.768266
## 104-S1   104-S1        HCC        T3        S1    R1 P104       0.742849## OTU Table:          [6 taxa and 10 samples]
##                      taxa are rows
##        101-S1 101-S3 101-S4 102-S3 102-S4 103-S1 103-S2 103-S3 103-S4 104-S1
## ASV_4     556    306    241    128    122    343    493      5     81    375
## ASV_5     227    306    255    254    185    189    259     30    155    190
## ASV_7      98    165     93    278    101    130    286     41     84    116
## ASV_8     423    187    188    133    129    197    329      0     65    257
## ASV_9     153    196    260    153    119    124    162      9     66    124
## ASV_10    192    192    319     72    186     73     69      1    153    119## Taxonomy Table:     [6 taxa by 6 taxonomic ranks]:
##        Kingdom    Phylum           Class                        
## ASV_4  "Bacteria" "Acidobacteria"  "Blastocatellia_(Subgroup_4)"
## ASV_5  "Bacteria" "Proteobacteria" "Gammaproteobacteria"        
## ASV_7  "Bacteria" "Proteobacteria" "Alphaproteobacteria"        
## ASV_8  "Bacteria" "Acidobacteria"  "Blastocatellia_(Subgroup_4)"
## ASV_9  "Bacteria" "Nitrospirae"    "Nitrospira"                 
## ASV_10 "Bacteria" "Bacteroidetes"  "Bacteroidia"                
##        Order                   Family              Genus         
## ASV_4  "Pyrinomonadales"       "Pyrinomonadaceae"  "RB41"        
## ASV_5  "Betaproteobacteriales" "SC-I-84"           NA            
## ASV_7  "Rhizobiales"           "Xanthobacteraceae" "Pseudolabrys"
## ASV_8  "Pyrinomonadales"       "Pyrinomonadaceae"  "RB41"        
## ASV_9  "Nitrospirales"         "Nitrospiraceae"    "Nitrospira"  
## ASV_10 "Chitinophagales"       "Chitinophagaceae"  NANormalization is currently a much-discussed issue of microbiome studies. Differences in read depth between samples often need to be corrected before analysis. Several normalization methods have been proposed, and no single method is perfect. It may be that the most appropriate method depends on the analysis.
In this tutorial, we will use:
Proportion-normalized data to estimate ecological metrics.
Stabilizing transformation normalized data to identify differentially abundant taxa.
However, for your reference, we will also present code for two other normalization methods: rarefaction and cumulative sum scaling.
For more details, read McMurdie and Holmes 2014 and McKnight et al. 2019.
Proportion normalization involves dividing each OTU count by the total sum for each sample. The resulting count data will add up to 1 (100%) for each sample.
The microbiome::transform function can be used to easily normalize count data as proportions in a phyloseq object:
# Proportion normalization:
ps_prop <- transform(ps, "compositional")
# Have a look at the resulting OTU table:
otu_table(ps_prop)[1:5, 1:5]## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##              ASV_4       ASV_5       ASV_7       ASV_8       ASV_9
## 101-S1 0.010902602 0.004451242 0.001921682 0.008294606 0.003000176
## 101-S3 0.008318382 0.008318382 0.004485402 0.005083456 0.005328114
## 101-S4 0.004277600 0.004526092 0.001650692 0.003336883 0.004614838
## 102-S3 0.004644749 0.009216924 0.010087815 0.004826185 0.005551927
## 102-S4 0.003036336 0.004604281 0.002513688 0.003210553 0.002961672## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1 
##      1      1      1      1      1      1This transformation method, available in DESeq2, normalizes data with respect to library size as well as dispersion.
Type ?vst or read Anders and Huber 2010 for more information.
# First we convert the phyloseq object to a DESeq object:
dds <- phyloseq_to_deseq2(ps, ~1)
# Then we estimate the size factors:
dds <- estimateSizeFactors(dds)
# Now can do the transformation:
dds_vst <- vst(dds, blind = TRUE)
# Have a look at the resulting counts:
vst_counts <- assay(dds_vst)
t(vst_counts)[1:5, 1:5]##           ASV_4    ASV_5    ASV_7    ASV_8    ASV_9
## 101-S1 8.002095 6.729126 5.559152 7.611915 6.175558
## 101-S3 7.459397 7.459397 6.585002 6.761345 6.827700
## 101-S4 7.388440 7.468770 6.047047 7.035948 7.496408
## 102-S3 6.756474 7.729205 7.858050 6.810556 7.008636
## 102-S4 7.108518 7.700594 6.841187 7.187648 7.073230Rarefaction can be used to subset data such that the library depth is the same for each sample. Because sampling of the data is random, rarefaction can account for an effect of total read count on taxa richness. However, rarefaction is no longer considered to be a good way to normalize amplicon sequencing data (see McMurdie & Holmes 2014).
ps_rarefied <- rarefy_even_depth(ps,
                                 rngseed = 1,
                                 sample.size = min(sample_sums(ps)),
                                 replace = FALSE)
# Have a look at the result:
otu_table(ps_rarefied)[1:5, 1:5]## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##        ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1    24    13    12    32     7
## 101-S3    26    22    20     9    19
## 101-S4    15    16     6    12    17
## 102-S3    15    26    33    19    11
## 102-S4     6    15     5     9    18## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1 
##   3258   3258   3258   3258   3258   3258Compare these sample sums to the non-rarified data sums:
## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1 
##  50997  36786  56340  27558  40180  38928The metagenomeSeq Cumulative Sum Scaling (CSS) normalization is another option developed for microbiome data. For more information, read Paulson et al. 2013.
# Convert the phyloseq object to a metagenomeseq object:
mgs_css <- phyloseq_to_metagenomeSeq(ps)
# Perform the Cumulative Sum Scaling:
mgs_css <- cumNorm(mgs_css)
# Extract the counts and add them to a separate phyloseq object:
css_counts <- MRcounts(mgs_css, norm = TRUE)
ps_css <- ps
otu_table(ps_css) <- otu_table(t(css_counts), taxa_are_rows = FALSE)Now lets compare the original data to the CSS normalized data:
## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##        ASV_4 ASV_5 ASV_7 ASV_8 ASV_9
## 101-S1   556   227    98   423   153
## 101-S3   306   306   165   187   196
## 101-S4   241   255    93   188   260
## 102-S3   128   254   278   133   153
## 102-S4   122   185   101   129   119## 101-S1 101-S3 101-S4 102-S3 102-S4 103-S1 
##  50997  36786  56340  27558  40180  38928## OTU Table:          [5 taxa and 5 samples]
##                      taxa are columns
##           ASV_4    ASV_5    ASV_7    ASV_8    ASV_9
## 101-S1 86.05479 35.13388 15.16793 65.46974 23.68054
## 101-S3 51.81172 51.81172 27.93769 31.66272 33.18659
## 101-S4 37.81579 40.01255 14.59281 29.49945 40.79711
## 102-S3 23.92971 47.48551 51.97233 24.86446 28.60348
## 102-S4 19.34052 29.32784 16.01141 20.45022 18.86493##   101-S1   101-S3   101-S4   102-S3   102-S4   103-S1 
## 7893.051 6228.581 8840.421 5151.991 6369.689 6386.874In this and following the examples we will use the proportions data.
# Subset to the T1 treatment
T1 <- subset_samples(ps_prop, Treatment == "T1")
# Agglomerate taxa to phylum level:
T1_phylum <- tax_glom(T1, taxrank = "Phylum", NArm = FALSE)
# Remove uncommon phyla:
T1_phylum <- subset_taxa(T1_phylum, taxa_sums(T1_phylum) > 0.1)
# Plot:
plot_bar(T1_phylum, fill = "Phylum") +
   facet_wrap(~Timepoint, scales = "free_x", nrow = 1)Alpha diversity measures taxonomic diversity within a single population. Measures of alpha diversity include taxonomic richness (i.e. number of taxa), and indices which combine taxonomic richness with some measure of evenness.
Note: Many of these methods are greatly influenced by singleton data (i.e. taxa represented by a single count), meaning they may be unreliable if your data excludes singletons. DADA2 does not output singletons if it is run individually on each sample. Thus, if using dada2 to infer ASVs, it is advisable to use pool = TRUE when running the dada algorithm. This will allow calling per-sample singletons, but not per-study singletons.
Our goal in defining richness is to determine the number of unique taxa within each population (in our case, each plot). This is difficult, as our samples contain only a portion of the total richness within our plots. As sequence depth increases for each sample, so does the number of taxa. This can be illustrated with a rarefaction plot.
To make a rarefaction plot, we draw random samples from our data and count the number of unique ASVs as samples are drawn. The resulting rarefaction curve is expected to rise quickly then plateu as the most abundant taxa are represented. We can make a quick rarefaction curve plot directly from our phyloseq object of all samples using the vegan package:
We can also split rarefaction curves by group using the ampvis2 package.
metadata <- data.frame(sample_data(ps), check.names = FALSE)
asv_table <- data.frame(t(otu_table(ps)),
                       tax_table(ps),
                       check.names = FALSE)ampvis2 requires a Species column, which must be added to our data. If your table already has a Species column, skip this step:
Load an ampvis2 object and calculate rarefaction:
ps3 <- amp_load(asv_table, metadata)
rar_plot <- amp_rarecurve(ps3,
                          stepsize = 100,
                          facet_by = "Timepoint",
                          color_by = "Treatment") +
   ylab("Number of observed ASVs")
rar_plotAs we can see, as read depth increases for each sample, so does taxonomic richness. In all samples, increased sampling would result in increased richness. Thus, our observed sample richness is lower than the true population richness for every plot.
Further, sequence depth and total richness appear to be correlated across samples:
dp_by_asv <- data.frame(ct_sums = sample_sums(ps),
                        asvs = rowSums(otu_table(ps) != 0))
ggplot(data = dp_by_asv) +
  geom_point(aes(x =  ct_sums, y = asvs)) +
  labs(x = "Sequence depth", y= "ASV count") +
  theme_minimal()We can test whether this correlation is significant:
## Warning in cor.test.default(dp_by_asv$ct_sums, dp_by_asv$asvs, method =
## "spearman"): Cannot compute exact p-value with ties## 
##  Spearman's rank correlation rho
## 
## data:  dp_by_asv$ct_sums and dp_by_asv$asvs
## S = 17650, p-value = 7.412e-12
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##       rho 
## 0.7040636Here we observe a significant correlation between read count and observed taxonomic richness.
Thus, a sample from a population with high actual diversity, but low read depth, could have low observed richness. Fortunately, there are methods to estimate the number of unobserved taxa. For an overview of these methods, read Bunge et al. 2014.
To illustrate the difficulty of estimating population richness and importance of singletons to alpha diversity metrics, we’ll look at the frequency count distribution for our first sample: 101-S1.
We’ll count the number of singletons, doubletons, etc., then plot frequency as a function of count.
# Get the frequency count distribution for the 1st sample:
frequencytablelist <- build_frequency_count_tables(otu_table(ps))
freq_ct <- frequencytablelist[[1]]
colnames(freq_ct) <- c("Frequency", "Count")
# Plot:
ggplot(freq_ct, aes(Frequency, Count)) + 
  geom_point() + 
  scale_x_continuous(breaks = seq(0, 600, by = 50)) + 
  ggtitle("101-S1") +
  theme_minimal()We can see there are ~3,200 taxa represented as singletons, and ~1,600 taxa as doubletons. This is a high proportion of the total number taxa are observed in this sample:
## [1] 8646## [1] 0.3693037How many more taxa might we observe if we increased our sampling?
We can use several indices to explore how evenly species are distributed within each sample. Note that ‘Observed’ is the sample taxonomic richness. The other three indices combine richness and abundance data for each taxon.
plot_richness(ps,
              x = "Timepoint",
              shape = "Treatment",
              color = "Treatment", 
              measures = c("Observed", "Shannon", "Simpson", "InvSimpson"))Notice that several T2:S3 samples have low measures, suggesting they have low diversity (i.e. abundances are dominated by a few taxa) compared to the other samples.
Beta diversity measures differences in microbial compositions between populations. If available (use ps_16S_V4_withtree.rds), we can use a phylogenetic tree and otu table to estimate the UniFrac for each sample pair.
The result is a distance matrix with height and width equal to the number of samples. UniFrac may be weighted or unweighted. Selection of weighted or unweighted will depend on the question you want to answer. Unweighted UniFrac uses only presence or absence of a taxon from the otu table (i.e. 1000 will be treated the same as 1). It is appropriate if you want to test qualitative taxa differences between samples. Weighted UniFrac uses presence as well as quantity data (i.e. 1 < 1000).
We are interested to know what changes to taxon abundance may be driven by our treatments, so we will use weighted UniFrac.
First we’ll make a weighted UniFrac distance matrix. For UniFrac, you require a tree as part of your phyloseq object.
##            101-S1     101-S3    101-S4     102-S3    102-S4     103-S1
## 101-S1 0.00000000 0.09488256 0.1767225 0.10183595 0.1973141 0.08065589
## 101-S3 0.09488256 0.00000000 0.1403436 0.06335683 0.1611763 0.10207325
## 101-S4 0.17672254 0.14034362 0.0000000 0.16667520 0.0712517 0.16746555
## 102-S3 0.10183595 0.06335683 0.1666752 0.00000000 0.1809738 0.09720146
## 102-S4 0.19731406 0.16117633 0.0712517 0.18097382 0.0000000 0.18488558
## 103-S1 0.08065589 0.10207325 0.1674656 0.09720146 0.1848856 0.00000000# Note: If your *phyloseq* object lacks a phylogenetic tree,
# you could use a dissimilarity index that does not require one, such as Bray-Curtis:
# bray <- vegan::vegdist(otu_table(ps_prop), method = "bray")
# as.matrix(bray)[1:6,1:6]Then perform Principal Coordinate Analysis (PCoA) on your favorite distance matrix. In this case, we are using weighted UniFrac.
do <- ordinate(ps_prop, method = "PCoA", distance = wUF)
baseplot <- plot_ordination(ps_prop, do,
                            type = "samples",
                            color = "Treatment",
                            shape = "Timepoint") +
  ggtitle("HCC ordination, all samples\nmethod=PCoA, distance=weighted UniFrac") +
  theme_bw()
# Single plot with all samples:
baseplot +
  geom_point(size = 2)# Idem, but with sample names on the plot: 
baseplot +
  geom_point(size = 1) + 
  geom_label(aes(label = SampleID), size = 3.5, color = "black")# Facet by time point:
baseplot +
  geom_point(size = 2) + 
  ggtitle("HCC ordination, all samples\nmethod=PCoA, distance=weighted UniFrac") +
  facet_wrap(~Timepoint, 2)We can see some grouping by timepoint. Depending on your experiment and questions, you might want to consider other ordination or clustering approaches.
Finally, we’ll test whether there is a significant ecological level treatment effect. PERMANOVA sounds fancy, but it is just an ANOVA performed using permutations. Permutations are used to determine how data may appear if there is no treatment effect and group differences are due to random chance. Observed data are then compared to the randomized data to calculate a p-value.
Treatments are independent within Timepoints, so we will first subset the data by Timepoint:
ps_S1 <- subset_samples(ps_prop, Timepoint == "S1")
ps_S3 <- subset_samples(ps_prop, Timepoint == "S3")
ps_S4 <- subset_samples(ps_prop, Timepoint == "S4")Subset the metadata and estimate UniFrac for S3:
Then we’ll use vegan::adonis2 to perform the PERMANOVA.
set.seed(100)  # Set the seed to make the analysis reproducible
permanova <- adonis2(unifrac_dist ~ Treatment,
                     data = metadata,
                     permutations = 10000)
permanova## Permutation test for adonis under reduced model
## Terms added sequentially (first to last)
## Permutation: free
## Number of permutations: 10000
## 
## adonis2(formula = unifrac_dist ~ Treatment, data = metadata, permutations = 10000)
##           Df SumOfSqs      R2      F Pr(>F)   
## Treatment  3 0.067035 0.24478 2.0528 0.0035 **
## Residual  19 0.206821 0.75522                 
## Total     22 0.273856 1.00000                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1The low p-value suggests that within the S3 timepoint (post-termination), there is a significant treatment effect.
Here, we will use DESeq2 to identify differentially abundant taxa between time points and treatments. DESeq2 was designed to analyze RNAseq datasets, which are similar to OTU/ASV data sets in that both handle large, sparse contingency tables generated from Illumina sequencing data. For more details, see McMurdie and Holmes 2014.
First, we’ll convert our non-normalized count data to a DESeq object.
## class: DESeqDataSet 
## dim: 17600 71 
## metadata(1): version
## assays(1): counts
## rownames(17600): ASV_4 ASV_5 ... ASV_18037 ASV_18038
## rowData names(0):
## colnames(71): 101-S1 101-S3 ... 604-S3 604-S4
## colData names(7): SampleID Experiment ... Plot PCR_prod_V4.V5## DataFrame with 71 rows and 7 columns
##        SampleID Experiment Treatment Timepoint    Block     Plot PCR_prod_V4.V5
##        <factor>   <factor>  <factor>  <factor> <factor> <factor>      <numeric>
## 101-S1   101-S1        HCC        T1        S1       R1     P101       0.464003
## 101-S3   101-S3        HCC        T1        S3       R1     P101       1.000000
## 101-S4   101-S4        HCC        T1        S4       R1     P101       0.629419
## 102-S3   102-S3        HCC        T4        S3       R1     P102       0.925081
## 102-S4   102-S4        HCC        T4        S4       R1     P102       0.746320
## ...         ...        ...       ...       ...      ...      ...            ...
## 603-S4   603-S4        HCC        T3        S4       R6     P603       0.796571
## 604-S1   604-S1        HCC        T2        S1       R6     P604       0.779915
## 604-S2   604-S2        HCC        T2        S2       R6     P604       0.730354
## 604-S3   604-S3        HCC        T2        S3       R6     P604       0.926025
## 604-S4   604-S4        HCC        T2        S4       R6     P604       0.817147There are two ways to analyze interaction effects using DESeq2. The first is to fit a multivariate model (e.g. ~A+B+A:B) and explore the model coefficients. The second is to fit a univariate model and explore pairwise contrasts. Here, we will group our two factors (Treatment and Timepoint) and use the latter approach.
# Prepare the data:
dds$Group <- factor(paste0(dds$Treatment, dds$Timepoint))
design(dds) <- formula(~Group)
dds <- dds[, -50]                    # Remove tech_rep sample
dds$Group <- droplevels(dds$Group)   # Drop tech_rep level from design matrix
# Perform the differential abundance analysis:
dds <- DESeq(dds)
results(dds)## log2 fold change (MLE): Group T4S4 vs T1S1 
## Wald test p-value: Group T4S4 vs T1S1 
## DataFrame with 17600 rows and 6 columns
##            baseMean log2FoldChange     lfcSE       stat          pvalue
##           <numeric>      <numeric> <numeric>  <numeric>       <numeric>
## ASV_4       163.314     -0.5518849  0.325626 -1.6948453 0.0901047681701
## ASV_5       161.981      1.6123820  0.282434  5.7088821 0.0000000113721
## ASV_7       115.697      0.7160631  0.322469  2.2205629 0.0263805813492
## ASV_8       104.576      0.0111895  0.371903  0.0300873 0.9759974233061
## ASV_9       107.540      0.7551121  0.235185  3.2107216 0.0013240215163
## ...             ...            ...       ...        ...             ...
## ASV_18034 0.0339025       0.862942   7.06641  0.1221189        0.902805
## ASV_18035 0.0339025       0.862942   7.06641  0.1221189        0.902805
## ASV_18036 0.0124023       0.382212   7.06641  0.0540885        0.956865
## ASV_18037 0.0243719       0.862942   7.06641  0.1221189        0.902805
## ASV_18038 0.0243719       0.862942   7.06641  0.1221189        0.902805
##                    padj
##               <numeric>
## ASV_4     0.20919428694
## ASV_5     0.00000586761
## ASV_7     0.09404214332
## ASV_8     0.98632468985
## ASV_9     0.01382801753
## ...                 ...
## ASV_18034            NA
## ASV_18035            NA
## ASV_18036            NA
## ASV_18037            NA
## ASV_18038            NALet’s explore the contrast, “T1S4 : T4S4” (no cc, seedling stage : late termination, seedling stage):
## 
## out of 17572 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 0, 0%
## LFC < 0 (down)     : 1, 0.0057%
## outliers [1]       : 46, 0.26%
## low counts [2]     : 0, 0%
## (mean count < 0)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results# Number of taxa with significantly higher abundance in T4S4:
sum(res$padj < 0.1 & res$log2FoldChange > 0, na.rm = TRUE)## [1] 0# Number of taxa with significantly lower abundance in T4S4:
sum(res$padj < 0.1 & res$log2FoldChange < 0, na.rm = TRUE)## [1] 1We can sort results by p-value and look at the top ASVs:
## log2 fold change (MLE): Group T1S4 vs T4S4 
##  
## DataFrame with 6 rows and 3 columns
##           baseMean log2FoldChange        padj
##          <numeric>      <numeric>   <numeric>
## ASV_3479   3.27868      -36.83123 4.44874e-12
## ASV_1740   2.85123       -3.53555 1.00000e+00
## ASV_1623   3.90335        1.91786 1.00000e+00
## ASV_336    9.93278        1.83339 1.00000e+00
## ASV_1950   3.67406       -2.23211 1.00000e+00
## ASV_487    9.63303       -1.24601 1.00000e+00And plot counts for the first in the ordered list ASV:
We can also plot counts by sample_ID:
The plot can also be saved and customized with ggplot:
d <- plotCounts(dds, gene = "ASV_48", intgroup = "Group", returnData = TRUE)
ggplot(d, aes(x = Group, y = count)) +
  geom_point(position = position_jitter(w = 0.1, h = 0), size = 2) +
  scale_y_log10() +
  ggtitle(expression(paste("ASV 48, log"[10], "-scale"))) +
  theme_bw()We can also subset our differentially abundant taxa to keep those with an adjusted p-value < 0.05:
## log2 fold change (MLE): Group T1S4 vs T4S4 
## Wald test p-value: Group T1S4 vs T4S4 
## DataFrame with 1 row and 6 columns
##           baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##          <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## ASV_3479   3.27868       -36.8312    4.4952  -8.19345 2.53837e-16 4.44874e-12What taxa are associated with significantly differentially abundant ASVs?
## Taxonomy Table:     [1 taxa by 6 taxonomic ranks]:
##          Kingdom    Phylum       Class        Order          
## ASV_3479 "Bacteria" "Firmicutes" "Clostridia" "Clostridiales"
##          Family             Genus                        
## ASV_3479 "Clostridiaceae_1" "Clostridium_sensu_stricto_9"Report that we are done!
## [1] "Done with ASV inference."Articles on data normalization: