Process metabarcoding reads, post-adapter removal, including:
This documented was adapted from Callahan et al. 2006 by Matthew Willman, with further edits by Soledad Benitez Ponce and Jelmer Poelstra.
To convert an Rmd
(R Markdown) file to an R script, type the following in an R console: knitr::purl(input='<filename>.Rmd')
. (You can download this Rmd
file by clicking the Code
button in the top-right of this page – but we will open it at OSC.)
For more detailed instructions of the first steps, see this section from our intro to R session.
Login to OSC at https://ondemand.osc.edu.
Click on Interactive Apps
(top bar) > RStudio Server (Owens and Pitzer)
.
Fill out the form as shown here.
Once your job has started, click Connect to RStudio Server
.
You should automatically be in your personal dir inside the dir /fs/project/PAS0471/workshops/2020-12_micro
, with your RStudio Project open. You can see whether a Project is open and which one in the top-right of your screen:
Here, the project jelmer
is open. Your project name is also your username.
If your Project isn’t open, click on the icon to open it:
Now, click on the markdown
directory in the Files pane, and open the file 07-ASV-inference.Rmd
. That’s this file!
Print some information to screen for when running this script as a job:
## [1] "Starting ASV inference script..."
## [1] "2020-12-15 17:36:50 EST"
## Working directory: /home/jelmer/Dropbox/mcic/teach/wsh/2020-12_microbiom/OSC/master
Most dada2 functions can use multiple cores. Because we are on a cluster and we have reserved only part of a node, auto-detection of the number of cores will not be appropriate (the program will detect more cores than we have available). Therefore, we should specify the appropriate number of cores in our function call below.
We will set the number of cores here, assumes you requested 4 cores in your job submission (change if needed):
To save time, we have already installed all the necessary R packages at OSC into a custom library. To add this library for the current R session:
Then, load the packages:
## [1] "Loading packages..."
packages <- c('tidyverse', 'gridExtra', 'dada2',
'phyloseq', 'DECIPHER', 'phangorn')
pacman::p_load(char = packages)
# If you wanted to install and load these packages yourself,
# just make sure you have the pacman package installed:
## install.packages('pacman')
# Then, the code above would work, as it will install pacakes as needed.
We’ll set most of the file paths upfront, which will make it easier to change things or troubleshoot.
# Dir with input fastq files:
indir <- 'data/processed/fastq_trimmed'
# Dirs for output:
filter_dir <- 'data/processed/fastq_filtered'
outdir <- 'analysis/ASV_inference'
dir.create(filter_dir, showWarnings = FALSE, recursive = TRUE)
dir.create(outdir, showWarnings = FALSE, recursive = TRUE)
# Fasta file with training data:
# (Check for an up-to-date version at <https://benjjneb.github.io/dada2/training.html>)
tax_key <- 'data/ref/silva_nr99_v138_train_set.fa'
# File with sample metadata:
metadata_file <- 'metadata/sample_meta.txt'
We will assign the fastq files that we processed with cutadapt
to two vectors: one with files with forward reads, and one with files with reverse reads. These files can be distinguished by having “R1” (forward) and “R2” (reverse) in their names.
fastqs_raw_F <- sort(list.files(indir, pattern = '_R1_001.fastq.gz', full.names = TRUE))
fastqs_raw_R <- sort(list.files(indir, pattern = '_R2_001.fastq.gz', full.names = TRUE))
print('First fastq files:')
## [1] "First fastq files:"
## [1] "data/processed/fastq_trimmed/101-S1-V4-V5_S1_L001_R1_001.fastq.gz"
## [2] "data/processed/fastq_trimmed/101-S4-V4-V5_S49_L001_R1_001.fastq.gz"
## [3] "data/processed/fastq_trimmed/102-S4-V4-V5_S50_L001_R1_001.fastq.gz"
## [4] "data/processed/fastq_trimmed/103-S4-V4-V5_S51_L001_R1_001.fastq.gz"
## [5] "data/processed/fastq_trimmed/104-S4-V4-V5_S52_L001_R1_001.fastq.gz"
## [6] "data/processed/fastq_trimmed/201-S4-V4-V5_S53_L001_R1_001.fastq.gz"
We’ll get the sample IDs from the fastq file names and from a file with metadata, and will check if they are the same. First we’ll prepare the metadata:
## [1] "Load and prepare sample metadata..."
metadata_df <- read.table(file = metadata_file, sep = "\t", header = TRUE)
colnames(metadata_df)[1] <- 'SampleID'
rownames(metadata_df) <- metadata_df$SampleID
print('First fastq files:')
## [1] "First fastq files:"
## SampleID Experiment Treatment Timepoint Block Plot
## blankM blankM control soil_blank <NA> <NA> <NA>
## H20 H20 control water <NA> <NA> <NA>
## Zymo Zymo control Zymo_mix <NA> <NA> <NA>
## 101-S1 101-S1 HCC T1 S1 R1 P101
## 101-S4 101-S4 HCC T1 S4 R1 P101
## 102-S4 102-S4 HCC T4 S4 R1 P102
## read_ct_ITS7ngs.ITS4ngsuni read_ct_V4.V5 PCR_prod_ITS7ngs.ITS4ngsuni
## blankM 1721 29042 0.038999
## H20 29601 1463 0.036260
## Zymo 103820 93064 1.272282
## 101-S1 127441 140005 0.394874
## 101-S4 133809 180836 0.541573
## 102-S4 102013 126495 0.649498
## PCR_prod_V4.V5
## blankM 0.160996
## H20 0.010000
## Zymo 1.083145
## 101-S1 0.464003
## 101-S4 0.629419
## 102-S4 0.746320
Let’s compare the sample IDs from the metadata with the fastq filenames:
## [1] "IDs from metadata:"
## [1] "blankM" "H20" "Zymo" "101-S1" "101-S4" "102-S4" "103-S1" "103-S4"
## [9] "104-S1" "104-S4" "201-S4" "202-S1" "202-S4" "203-S1" "203-S4" "204-S1"
## [17] "204-S4" "301-S1" "301-S4" "302-S4" "303-S1" "303-S4" "304-S1" "304-S4"
## [25] "401-S1" "401-S4" "402-S1" "402-S4" "403-S1" "403-S4" "404-S4" "501-S4"
## [33] "501S4" "502-S1" "502-S4" "503-S1" "503-S4" "504-S1" "504-S4" "601-S1"
## [41] "601-S4" "602-S4" "603-S1" "603-S4" "604-S1" "604-S4"
## [1] "Fastq file names:"
## [1] "101-S1-V4-V5_S1_L001_R1_001.fastq.gz"
## [2] "101-S4-V4-V5_S49_L001_R1_001.fastq.gz"
## [3] "102-S4-V4-V5_S50_L001_R1_001.fastq.gz"
## [4] "103-S4-V4-V5_S51_L001_R1_001.fastq.gz"
## [5] "104-S4-V4-V5_S52_L001_R1_001.fastq.gz"
## [6] "201-S4-V4-V5_S53_L001_R1_001.fastq.gz"
To extract the sample IDs from the fastq file names, we remove everything after “-V4-V5” from the file names using the sub()
function:
# sub() arguments: sub(pattern, replacement, vector)
sampleIDs <- sub("-V4-V5_.*", "", basename(fastqs_raw_F))
We can check whether the IDs from the fastq files and the metadata dataframe are the same:
## [1] "Are the sample IDs from the metadata and the fastq files the same?"
## [1] FALSE
## [1] "Are any samples missing from the fastq files?"
## [1] "103-S1" "104-S1" "501S4"
## [1] "Are any samples missing from the metadata?"
## character(0)
As it turns out, we don’t have sequences for three samples in the metadata.
DADA2 provide a function to plot the average base quality across sequence reads, plotQualityProfile()
. You can generate and evaluate plots for each sample, e.g. the forward reads and reverse reads side-by-side like so:
This code will generate a pdf file with plots for each sample:
pdf(file.path(outdir, 'error_profiles.pdf')) # Open a pdf file
for (sample_idx in 1:length(fastqs_raw_F)) { # Loop through samples
print(plotQualityProfile( # Print plots into pdf
c(fastqs_raw_F[sample_idx], fastqs_raw_R[sample_idx])) # F and R together
)
}
dev.off() # Close the pdf file
## png
## 2
fastqc
program on your fastq files for more extensive QC. This is a stand-alone program that is easy to run from the command-line. When you have many samples, as is often the case, fastqc
’s results can moreover be nicely summarized using [multiqc
](https://multiqc.info/. In the interest of time, we skipped these steps during this workshop.
We will now perform quality filtering (removing poor-quality reads) and trimming (removing poor-quality bases) on the fastq files using DADA2’s filterAndTrim()
function.
The filterAndTrim()
function will write the filtered and trimmed reads to new fastq files. Therefore, we first define the file names for the new files:
fastqs_filt_F <- file.path(filter_dir, paste0(sampleIDs, '_F_filt.fastq'))
fastqs_filt_R <- file.path(filter_dir, paste0(sampleIDs, '_R_filt.fastq'))
The truncLen
argument of filterAndTrim()
defines the read lengths (for forward and reverse reads, respectively) beyond which additional bases should be removed, and these values should be based on the sequence quality visualized above. The trimming length can thus be different for forward and reverse reads, which is good because reverse reads are often of worse quality.
It is also suggested to trim the first 10 nucleotides of each read (trimLeft
argument), since these positions are likely to contain errors.
maxEE
is an important argument that will let DADA2 trim reads based on the maximum numbers of Expected Errors (EE) given the quality scores of the reads’ bases.
## [1] "Filtering and Trimming..."
## [1] "2020-12-15 17:37:44 EST"
filter_results <-
filterAndTrim(fastqs_raw_F, fastqs_filt_F,
fastqs_raw_R, fastqs_filt_R,
truncLen = c(250,210),
trimLeft = 10,
maxN = 0,
maxEE = c(2,2),
truncQ = 2,
rm.phix = FALSE,
multithread = n_cores,
compress = FALSE, verbose = TRUE)
print('...Done!')
## [1] "...Done!"
## [1] "2020-12-15 17:37:56 EST"
## reads.in reads.out
## 101-S1-V4-V5_S1_L001_R1_001.fastq.gz 22968 20587
## 101-S4-V4-V5_S49_L001_R1_001.fastq.gz 29552 26664
## 102-S4-V4-V5_S50_L001_R1_001.fastq.gz 20668 18832
## 103-S4-V4-V5_S51_L001_R1_001.fastq.gz 18894 17027
## 104-S4-V4-V5_S52_L001_R1_001.fastq.gz 20526 18400
## 201-S4-V4-V5_S53_L001_R1_001.fastq.gz 15411 13890
Next, we want to “dereplicate” the filtered fastq files. During dereplication, we condense the data by collapsing together all reads that encode the same sequence, which significantly reduces later computation times.
fastqs_derep_F <- derepFastq(fastqs_filt_F, verbose = FALSE)
fastqs_derep_R <- derepFastq(fastqs_filt_R, verbose = FALSE)
names(fastqs_derep_F) <- sampleIDs
names(fastqs_derep_R) <- sampleIDs
The DADA2 algorithm makes use of a parametric error model (err
) and every amplicon dataset has a different set of error rates. The learnErrors
method learns this error model from the data, by alternating estimation of the error rates and inference of sample composition until they converge on a jointly consistent solution.
## [1] "Learning errors..."
## [1] "2020-12-15 17:39:43 EST"
## 101772240 total bases in 424051 reads from 30 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 ..............................
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## Convergence after 6 rounds.
## 101413800 total bases in 507069 reads from 37 samples will be used for learning the error rates.
## Initializing error rates to maximum possible estimate.
## selfConsist step 1 .....................................
## selfConsist step 2
## selfConsist step 3
## selfConsist step 4
## selfConsist step 5
## selfConsist step 6
## selfConsist step 7
## Convergence after 7 rounds.
## [1] "...Done!"
## [1] "2020-12-15 17:58:15 EST"
We’ll plot errors to verify that error rates have been reasonable well-estimated. Pay attention to the fit between observed error rates (points) and fitted error rates (lines):
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Transformation introduced infinite values in continuous y-axis
We will now run the core dada algorithm, which infers Amplicon Sequence Variants (ASVs) from the sequences.
This step is quite computationally intensive, and for this tutorial, we will therefore perform independent inference for each sample (pool = FALSE
), which will keep the computation time down.
Pooling will increase computation time, especially if you have many samples, but will improve detection of rare variants seen once or twice in an individual sample, but many times across all samples. Therefore, for your own analysis, you will likely want to use pooling, though “pseudo-pooling” is also an option.
## [1] "Inferring ASVs (running the dada algorithm)..."
## [1] "2020-12-15 17:58:17 EST"
## Sample 1 - 20587 reads in 11912 unique sequences.
## Sample 2 - 26664 reads in 18924 unique sequences.
## Sample 3 - 18832 reads in 14124 unique sequences.
## Sample 4 - 17027 reads in 12938 unique sequences.
## Sample 5 - 18400 reads in 13709 unique sequences.
## Sample 6 - 13890 reads in 10701 unique sequences.
## Sample 7 - 10653 reads in 7754 unique sequences.
## Sample 8 - 15501 reads in 12495 unique sequences.
## Sample 9 - 11463 reads in 7914 unique sequences.
## Sample 10 - 16660 reads in 12549 unique sequences.
## Sample 11 - 6452 reads in 5207 unique sequences.
## Sample 12 - 13391 reads in 9915 unique sequences.
## Sample 13 - 9108 reads in 4464 unique sequences.
## Sample 14 - 14732 reads in 11046 unique sequences.
## Sample 15 - 17704 reads in 13544 unique sequences.
## Sample 16 - 10184 reads in 7074 unique sequences.
## Sample 17 - 16977 reads in 12989 unique sequences.
## Sample 18 - 17038 reads in 11455 unique sequences.
## Sample 19 - 15461 reads in 12148 unique sequences.
## Sample 20 - 13600 reads in 9418 unique sequences.
## Sample 21 - 13785 reads in 10744 unique sequences.
## Sample 22 - 10007 reads in 7397 unique sequences.
## Sample 23 - 13170 reads in 10212 unique sequences.
## Sample 24 - 9354 reads in 7054 unique sequences.
## Sample 25 - 13632 reads in 11245 unique sequences.
## Sample 26 - 11997 reads in 9402 unique sequences.
## Sample 27 - 12762 reads in 7088 unique sequences.
## Sample 28 - 9438 reads in 6332 unique sequences.
## Sample 29 - 13708 reads in 9840 unique sequences.
## Sample 30 - 11874 reads in 7934 unique sequences.
## Sample 31 - 7404 reads in 6148 unique sequences.
## Sample 32 - 11950 reads in 7843 unique sequences.
## Sample 33 - 12215 reads in 9898 unique sequences.
## Sample 34 - 11357 reads in 7668 unique sequences.
## Sample 35 - 12819 reads in 10280 unique sequences.
## Sample 36 - 17115 reads in 13034 unique sequences.
## Sample 37 - 10158 reads in 6924 unique sequences.
## Sample 38 - 15814 reads in 12189 unique sequences.
## Sample 39 - 13279 reads in 8208 unique sequences.
## Sample 40 - 11744 reads in 9143 unique sequences.
## Sample 41 - 4368 reads in 545 unique sequences.
## Sample 42 - 150 reads in 54 unique sequences.
## Sample 43 - 13995 reads in 6218 unique sequences.
## Sample 1 - 20587 reads in 11751 unique sequences.
## Sample 2 - 26664 reads in 17610 unique sequences.
## Sample 3 - 18832 reads in 12994 unique sequences.
## Sample 4 - 17027 reads in 12182 unique sequences.
## Sample 5 - 18400 reads in 12910 unique sequences.
## Sample 6 - 13890 reads in 10029 unique sequences.
## Sample 7 - 10653 reads in 7480 unique sequences.
## Sample 8 - 15501 reads in 11594 unique sequences.
## Sample 9 - 11463 reads in 7622 unique sequences.
## Sample 10 - 16660 reads in 11811 unique sequences.
## Sample 11 - 6452 reads in 4882 unique sequences.
## Sample 12 - 13391 reads in 9454 unique sequences.
## Sample 13 - 9108 reads in 4426 unique sequences.
## Sample 14 - 14732 reads in 10156 unique sequences.
## Sample 15 - 17704 reads in 12526 unique sequences.
## Sample 16 - 10184 reads in 6987 unique sequences.
## Sample 17 - 16977 reads in 12281 unique sequences.
## Sample 18 - 17038 reads in 11144 unique sequences.
## Sample 19 - 15461 reads in 11403 unique sequences.
## Sample 20 - 13600 reads in 9092 unique sequences.
## Sample 21 - 13785 reads in 10225 unique sequences.
## Sample 22 - 10007 reads in 7311 unique sequences.
## Sample 23 - 13170 reads in 9521 unique sequences.
## Sample 24 - 9354 reads in 6909 unique sequences.
## Sample 25 - 13632 reads in 10603 unique sequences.
## Sample 26 - 11997 reads in 8846 unique sequences.
## Sample 27 - 12762 reads in 6770 unique sequences.
## Sample 28 - 9438 reads in 6366 unique sequences.
## Sample 29 - 13708 reads in 9332 unique sequences.
## Sample 30 - 11874 reads in 7681 unique sequences.
## Sample 31 - 7404 reads in 5879 unique sequences.
## Sample 32 - 11950 reads in 7686 unique sequences.
## Sample 33 - 12215 reads in 9301 unique sequences.
## Sample 34 - 11357 reads in 7563 unique sequences.
## Sample 35 - 12819 reads in 9578 unique sequences.
## Sample 36 - 17115 reads in 12018 unique sequences.
## Sample 37 - 10158 reads in 6869 unique sequences.
## Sample 38 - 15814 reads in 11416 unique sequences.
## Sample 39 - 13279 reads in 8221 unique sequences.
## Sample 40 - 11744 reads in 8657 unique sequences.
## Sample 41 - 4368 reads in 514 unique sequences.
## Sample 42 - 150 reads in 49 unique sequences.
## Sample 43 - 13995 reads in 5068 unique sequences.
## [1] "...Done."
## [1] "2020-12-15 18:02:00 EST"
Let’s inspect one of the resulting objects:
## dada-class: object describing DADA2 denoising results
## 804 sequence variants were inferred from 11912 input unique sequences.
## Key parameters: OMEGA_A = 1e-40, OMEGA_C = 1e-40, BAND_SIZE = 16
In this step, we will first merge the forward and reverse read pairs: the fragment that we amplified with our primers was short enough to generate lots of overlap among the sequences from the two directions.
## 15174 paired-reads (in 657 unique pairings) successfully merged out of 17170 (in 1182 pairings) input.
## 18822 paired-reads (in 617 unique pairings) successfully merged out of 21702 (in 1334 pairings) input.
## 12203 paired-reads (in 438 unique pairings) successfully merged out of 14599 (in 990 pairings) input.
## 11228 paired-reads (in 349 unique pairings) successfully merged out of 13089 (in 804 pairings) input.
## 11977 paired-reads (in 415 unique pairings) successfully merged out of 14228 (in 939 pairings) input.
## 8733 paired-reads (in 302 unique pairings) successfully merged out of 10403 (in 652 pairings) input.
## 6067 paired-reads (in 292 unique pairings) successfully merged out of 7911 (in 594 pairings) input.
## 9224 paired-reads (in 314 unique pairings) successfully merged out of 11520 (in 731 pairings) input.
## 7866 paired-reads (in 371 unique pairings) successfully merged out of 9112 (in 660 pairings) input.
## 10839 paired-reads (in 422 unique pairings) successfully merged out of 12749 (in 869 pairings) input.
## 3045 paired-reads (in 155 unique pairings) successfully merged out of 4265 (in 306 pairings) input.
## 8843 paired-reads (in 347 unique pairings) successfully merged out of 10120 (in 695 pairings) input.
## 6519 paired-reads (in 181 unique pairings) successfully merged out of 7587 (in 347 pairings) input.
## 9875 paired-reads (in 322 unique pairings) successfully merged out of 11766 (in 717 pairings) input.
## 11080 paired-reads (in 418 unique pairings) successfully merged out of 13585 (in 960 pairings) input.
## 6362 paired-reads (in 364 unique pairings) successfully merged out of 7619 (in 637 pairings) input.
## 10856 paired-reads (in 381 unique pairings) successfully merged out of 13017 (in 906 pairings) input.
## 11444 paired-reads (in 494 unique pairings) successfully merged out of 13383 (in 987 pairings) input.
## 9218 paired-reads (in 317 unique pairings) successfully merged out of 11473 (in 764 pairings) input.
## 9098 paired-reads (in 383 unique pairings) successfully merged out of 10735 (in 722 pairings) input.
## 8734 paired-reads (in 329 unique pairings) successfully merged out of 10264 (in 685 pairings) input.
## 6138 paired-reads (in 287 unique pairings) successfully merged out of 7375 (in 541 pairings) input.
## 7976 paired-reads (in 320 unique pairings) successfully merged out of 9760 (in 707 pairings) input.
## 5425 paired-reads (in 259 unique pairings) successfully merged out of 6807 (in 502 pairings) input.
## 7460 paired-reads (in 268 unique pairings) successfully merged out of 9643 (in 635 pairings) input.
## 7248 paired-reads (in 269 unique pairings) successfully merged out of 8848 (in 617 pairings) input.
## 9421 paired-reads (in 472 unique pairings) successfully merged out of 10506 (in 775 pairings) input.
## 6196 paired-reads (in 313 unique pairings) successfully merged out of 7375 (in 574 pairings) input.
## 8947 paired-reads (in 327 unique pairings) successfully merged out of 10610 (in 729 pairings) input.
## 7764 paired-reads (in 394 unique pairings) successfully merged out of 9293 (in 693 pairings) input.
## 3899 paired-reads (in 162 unique pairings) successfully merged out of 4966 (in 354 pairings) input.
## 8331 paired-reads (in 403 unique pairings) successfully merged out of 9554 (in 689 pairings) input.
## 6657 paired-reads (in 264 unique pairings) successfully merged out of 8738 (in 645 pairings) input.
## 7524 paired-reads (in 404 unique pairings) successfully merged out of 8689 (in 688 pairings) input.
## 7419 paired-reads (in 276 unique pairings) successfully merged out of 9371 (in 649 pairings) input.
## 10806 paired-reads (in 372 unique pairings) successfully merged out of 13282 (in 895 pairings) input.
## 6429 paired-reads (in 334 unique pairings) successfully merged out of 7670 (in 600 pairings) input.
## 9668 paired-reads (in 343 unique pairings) successfully merged out of 12138 (in 826 pairings) input.
## 9358 paired-reads (in 450 unique pairings) successfully merged out of 10747 (in 782 pairings) input.
## 6864 paired-reads (in 271 unique pairings) successfully merged out of 8791 (in 651 pairings) input.
## 4257 paired-reads (in 29 unique pairings) successfully merged out of 4342 (in 34 pairings) input.
## 142 paired-reads (in 23 unique pairings) successfully merged out of 142 (in 23 pairings) input.
## 11423 paired-reads (in 187 unique pairings) successfully merged out of 13340 (in 890 pairings) input.
Just like tables can be saved in R using write.table
or write.csv
, R objects can be saved using saveRDS
. The resulting rds file can then be loaded into an R environment using readRDS
. This is a convenient way to save R objects that require a lot of computation time.
We should not be needing the very large dereplicated sequence objects anymore, but to be able to quickly restart our analysis from a new R session if necessary, we now save these objects to rds files. And after that, we can safely remove these objects from our environment.
saveRDS(fastqs_derep_F, file = file.path(outdir, 'fastqs_derep_F.rds'))
saveRDS(fastqs_derep_R, file = file.path(outdir, 'fastqs_derep_R.rds'))
rm(fastqs_derep_F, fastqs_derep_R) # Remove objects from environment
Next, we construct an amplicon sequence variant table (ASV) table:
seqtab_all <- makeSequenceTable(mergers)
# The dimensions of the object are the nr of samples (rows) and the nr of ASVs (columns):
dim(seqtab_all)
## [1] 43 4076
Let’s inspect the distribution of sequence lengths:
##
## 240 241 251 252 253 254 259 262 264 268 283
## 4048 1 7 2 10 1 1 1 1 3 1
# If you need to remove sequences of a particular length (e.g. too long):
# seqtab2 <- seqtab[, nchar(colnames(seqtab_all)) %in% seq(250,256)]
Now, we will remove chimeras. The dada algorithm models and removes substitution errors, but chimeras are another importance source of spurious sequences in amplicon sequencing. Chimeras are formed during PCR amplification. When one sequence is incompletely amplified, the incomplete amplicon primes the next amplification step, yielding a spurious amplicon. The result is a sequence read which is half of one sample sequence and half another.
Fortunately, the accuracy of the sequence variants after denoising makes identifying chimeras simpler than it is when dealing with fuzzy OTUs. Chimeric sequences are identified if they can be exactly reconstructed by combining a left-segment and a right-segment from two more abundant “parent” sequences.
seqtab <- removeBimeraDenovo(seqtab_all,
method = 'consensus',
multithread = n_cores,
verbose = TRUE)
## Identified 199 bimeras out of 4076 input sequences.
## [1] 3877
## [1] 0.9767404
We will save the seqtab
object as an rds file:
In this step, we will generate a summary table of the number of sequences processed and outputs of different steps of the pipeline.
This information is generally used to further evaluate characteristics and quality of the run, sample-to-sample variation, and resulting sequencing depth for each sample.
To get started, we will define a function getN()
that will get the number of unique reads for a sample. Then, we apply getN()
to each element of the dada_Fs
, dada_Rs
, and mergers
objects, which gives us vectors with the number of unique reads for each samples, during each of these steps:
getN <- function(x) {
sum(getUniques(x))
}
denoised_F <- sapply(dada_Fs, getN)
denoised_R <- sapply(dada_Rs, getN)
merged <- sapply(mergers, getN)
We’ll join these vectors together with the “filter_results” dataframe, and the number of nonchimeric reads:
nreads_summary <- data.frame(filter_results,
denoised_F,
denoised_R,
nonchim = rowSums(seqtab),
row.names = sampleIDs)
colnames(nreads_summary)[1:2] <- c('input', 'filtered')
# Have a look at the first few rows:
head(nreads_summary)
## input filtered denoised_F denoised_R nonchim
## 101-S1 22968 20587 17960 18234 15120
## 101-S4 29552 26664 22429 23511 18796
## 102-S4 20668 18832 15243 16034 12194
## 103-S4 18894 17027 13660 14402 11184
## 104-S4 20526 18400 14780 15689 11958
## 201-S4 15411 13890 10862 11564 8717
Finally, we’ll write this table to file:
write.table(nreads_summary, file = file.path(outdir, 'nreads_summary.txt'),
sep = "\t", quote = FALSE, row.names = TRUE)
Now, we will assign taxonomy to our ASVs.
Depending on the marker gene and the data, you will have to choose the appropriate reference file for this step. Several files have been formatted for taxonomy assignments in DADA2 pipeline and are available at the DADA2 website.
## [1] "Assigning taxa to ASVs..."
## [1] "2020-12-15 18:03:31 EST"
taxa <- assignTaxonomy(seqtab, tax_key, multithread = n_cores)
colnames(taxa) <- c('Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus')
print('...Done.')
## [1] "...Done."
## [1] "2020-12-15 18:09:54 EST"
In this last step, we will generate output files from the DADA2 outputs that are formatted for downstream analysis in phyloseq. First, we will write a fasta file with the final ASV sequences. (This fasta file can also be used for phylogenetic tree inference with different R packages.)
# Prepare sequences and headers:
asv_seqs <- colnames(seqtab)
asv_headers <- paste('>ASV', 1:ncol(seqtab), sep = '_')
# Interleave headers and sequences:
asv_fasta <- c(rbind(asv_headers, asv_seqs))
# Write fasta file:
write(asv_fasta, file = file.path(outdir, 'ASVs.fa'))
Now, we build the final phyloseq object. Notes:
While we will not add a phylogenetic tree now, this can also be added to a phyloseq object.
Our metadata dataframe contains three samples that we don’t have sequences for. However, this is not a problem: phyloseq will match the sample IDs in the metadata with those in the OTU table, and disregard IDs not present in the OTU table.
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
sample_data(metadata_df),
tax_table(taxa))
# Saves the phyloseq object as an .rds file (which can be imported directly by phyloseq):
saveRDS(ps, file = file.path(outdir, 'ps_V4.rds'))
Report that we are done!
## [1] "Done with ASV inference."
dada2
documentation and tutorialsdada2
cutadapt
documentation and tutorialsA phylogenetic tree can be estimated for the sequence data you generated. Depending on the number of ASVs recovered and the phylogenetic tree algorithm of choice, this step could take several days. Simpler trees will be less computationally intensive. Depending on the marker gene you are working on, you may or may not choose to perform this step.
This step can be conducted after Step 10, and then the phylogeny can be included in the phyloseq object in Step 11.
#seqtab<- readRDS('seqtab_V4.rds')
seqs <- getSequences(seqtab)
# This propagates to the tip labels of the tree.
# At this stage ASV labels are full ASV sequence
names(seqs) <- seqs
alignment <- AlignSeqs(DNAStringSet(seqs),
anchor = NA,
iterations = 5,
refinements = 5)
print('Computing pairwise distances from ASVs...')
Sys.time()
phang.align <- phyDat(as(alignment, 'matrix'), type = 'DNA')
dm <- dist.ml(phang.align)
treeNJ <- NJ(dm) # Note, tip order is not sequence order
fit = pml(treeNJ, data = phang.align)
print('...Done.')
Sys.time()
print('Fit GTR model...')
Sys.time()
fitGTR <- update(fit, k = 4, inv = 0.2)
print('...Done.')
Sys.time()
print('Computing likelihood of tree...')
Sys.time()
fitGTR <- optim.pml(fitGTR,
model = 'GTR',
optInv = TRUE,
optGamma = TRUE,
rearrangement = 'stochastic',
control = pml.control(trace = 0))
print('...Done'.)
Sys.time()
Extract R code from this document:
Our script 02-ASV-inference.sh
with SLURM directives that will submit the R script from the shell using Rscript
:
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --time=2:00:00
#SBATCH --account=PAS0471
module load gnu/9.1.0
module load mkl/2019.0.5
module load R/4.0.2
Rscript scripts/02-ASV-inference.R
Submit the script: