Genome assembly and assembly QC
Introduction
We will now create a genome assembly from the preprocessed reads with the program SPAdes.
After the assembly, we will run several assembly QC and filtering steps to:
- Get some basic assembly summary stats with BBtools
- Run Busco to check for genome completeness
- Identify and remove contaminant contigs with Kraken2
Setting up
You should have an active VS Code session with an open terminal. In that terminal, you should be be in your dir /fs/scratch/PAS2250/cabana/$USER/bact/bact
.
1 The FASTA format
The genome assembly that we will create will be in the FASTA formet. FASTA files contain one or more DNA or amino acid sequences, with no limits on the number of sequences or the sequence lengths.
The following example FASTA file contains two entries:
>unique_sequence_ID Optional description
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAAAA
>unique_sequence_ID2
ATTCATTAAAGCAGTTTATTGGCTTAATGTACATCAGTGAAATCATAAATGCTAAATG
Each entry consists of a header line and the sequence itself. Header lines start with a >
(greater-than sign) and are otherwise “free form”, though the idea is that they provide an identifier for the sequence that follows.1
.fa
, .fasta
, .fna
, .faa
(Click to expand)
- “Generic” extensions are
.fasta
and.fa
(e.g:my_assembly.fasta
) - Also used are extensions that explicitly indicate whether sequences are nucleotides (
.fna
) or amino acids (.faa
)
2 “Batch jobs” with sbatch
So far, we having been running programs by directly typing/pasting the commands in the terminal. Because SPAdes needs more computing resources, we will run it differently.
We will submit it as a so-called “batch” (non-interactive) job using the sbatch
command. This job will then be run on a compute node that we ourselves never move to (!).
To first see a simple example, say that we just wanted to run echo Hello there $USER
as a batch job, where we’ll use these options:
-A
for the OSC Project we want to use-t
for the time in minutes that we need-c
for the number of cores that we need-o
for the name of the log file (where “Hello there” will be printed) - The command that we want to run in the batch job is wrapped in
wrap="<command>"
sbatch -A PAS2250 -t 1 -c 1 -o slurm-hello.out \
--wrap="echo Hello there $USER"
Submitted batch job 25928455
# [You will get a different number, each job has a unique ID]
Now, the output of our command (ls -lh
) is not printed to the screen, but will end up in a file in our working directory, whose name starts with slurm
and contains the job ID number:
ls
data README.md results slurm-hello.out
slurm
file will only show up once the job has started running, which can take up to a minute or so.
Let’s take a look at that “Slurm log file”:
cat slurm-hello.out
Hello there jelmer
3 Assembly with SPAdes
SPAdes is a well-performing and very flexible assembler that can be used to do many kinds of assemblies, including metagenomes and transcriptomes. It has a special “mode” for bacterial isolate assembly, which can be activated with the --isolate
flag.
3.1 Our SPAdes command
We will run SPAdes with the following options:
-1
and-2
for the R1 and R2 FASTQ files-o
for the output dir, which should be sample-specific, and should not yet exist--isolate
to activate the bacterial isolate mode-k 21,33,55,77,99,127
to assemble with a variety of kmer sizes--threads 20
and--memory 80
to use 20 threads and 80 GB of memory
# (Don't run this yet)
spades.py \
-1 results/trimgalore/SM04_R1_val_1.fq.gz \
-2 results/trimgalore/SM04_R2_val_2.fq.gz \
-o results/spades/SM04 \
-k 21,33,55,77,99,127 \
--isolate \
--threads 20 \
--memory 80
3.2 Running SPAdes
First, we’ll have to switch back to the cabana
Conda environment, since we activated a different Conda environment for TrimGalore earlier.
source activate /fs/ess/PAS0471/jelmer/conda/cabana
Submitting the SPAdes sbatch
job
Now we’re ready to submit our SPAdes job, with 30 minutes (-t 30
) and 20 cores (-c 20
):
sbatch -A PAS2250 -t 30 -c 20 -o slurm-spades.out --wrap="
spades.py \
-1 results/trimgalore/SM04_R1_val_1.fq.gz \
-2 results/trimgalore/SM04_R2_val_2.fq.gz \
-o results/spades/SM04 \
-k 21,33,55,77,99,127 \
--isolate \
--threads 20 \
--memory 80
"
Exercise: Monitor the SPAdes run
Use
less
to check theslurm-spades.out
file, which will have the SPAdes “log”.In
less
, press G (capital G!) to look at the end of the file.
When you first check the slurm-spades.out
file, SPAdes should still be running. You know it will be done when you see the following line at the end:
Thank you for using SPAdes!
- To monitor the progress of the SPAdes run, you can use
tail -f slurm-spades.out
, which will “follow” the file and add any new text in real-time! To exit this, press Ctrl+c.
3.3 SPAdes output files
Let’s check the files in the output dir:
ls -lh results/spades/SM04
Click to show the output
total 47M
-rw-r--r-- 1 jelmer PAS0471 5.8M Feb 4 21:49 assembly_graph_after_simplification.gfa
-rw-r--r-- 1 jelmer PAS0471 12M Feb 4 21:49 assembly_graph.fastg
-rw-r--r-- 1 jelmer PAS0471 5.8M Feb 4 21:49 assembly_graph_with_scaffolds.gfa
-rw-r--r-- 1 jelmer PAS0471 5.9M Feb 4 21:49 before_rr.fasta
-rw-r--r-- 1 jelmer PAS0471 5.8M Feb 4 21:49 contigs.fasta
-rw-r--r-- 1 jelmer PAS0471 9.8K Feb 4 21:49 contigs.paths
-rw-r--r-- 1 jelmer PAS0471 79 Feb 4 21:42 dataset.info
-rw-r--r-- 1 jelmer PAS0471 278 Feb 4 21:42 input_dataset.yaml
drwxr-xr-x 4 jelmer PAS0471 4.0K Feb 4 21:49 K127
drwxr-xr-x 4 jelmer PAS0471 4.0K Feb 4 21:43 K21
drwxr-xr-x 4 jelmer PAS0471 4.0K Feb 4 21:44 K33
drwxr-xr-x 4 jelmer PAS0471 4.0K Feb 4 21:45 K55
drwxr-xr-x 4 jelmer PAS0471 4.0K Feb 4 21:47 K77
drwxr-xr-x 4 jelmer PAS0471 4.0K Feb 4 21:48 K99
drwxr-xr-x 2 jelmer PAS0471 4.0K Feb 4 21:49 logs
drwxr-xr-x 2 jelmer PAS0471 4.0K Feb 4 21:49 misc
-rw-r--r-- 1 jelmer PAS0471 1.5K Feb 4 21:42 params.txt
drwxr-xr-x 2 jelmer PAS0471 4.0K Feb 4 21:49 pipeline_state
-rw-r--r-- 1 jelmer PAS0471 3.4K Feb 4 21:42 run_spades.sh
-rw-r--r-- 1 jelmer PAS0471 4.9K Feb 4 21:42 run_spades.yaml
-rw-r--r-- 1 jelmer PAS0471 5.8M Feb 4 21:49 scaffolds.fasta
-rw-r--r-- 1 jelmer PAS0471 8.9K Feb 4 21:49 scaffolds.paths
-rw-r--r-- 1 jelmer PAS0471 5.8M Feb 4 21:49 SM04
-rw-r--r-- 1 jelmer PAS0471 194K Feb 4 21:49 spades.log
There are quite some files, as well as subdirs for different k-mer sizes, but we’re really only interested in the assembly FASTA file, which is contigs.fasta
. Let’s take a look at that file:
less results/spades/SM04/contigs.fasta
>NODE_1_length_1267796_cov_33.239498
ACCTTGAGTTCCCTAAAGGGCCGTCGAAGACTACGACGTTGATAGGTTGGGTGTGTAAGC
GCTGTGAGGCGTTGAGCTAACCAATACTAATTGCCCGTGAGGCTTGACCATATAACACCC
AAGCAATTTGCGTTGAATGAGCAGATTGCGGTGACTGTGAAGATGACACGAACCGAAAGT
TTGCGTCACGAACGACACCTGAACCAGCTTGCTATCACATACCCGATTTGCTGAAGCGCG
CCGCAAGGCACGATTCGGTACCCGAATTTCTTGACGACCATAGAGCATTGGAACCACCTG
ATCCCATCCCGAACTCAGTAGTGAAACGATGTATCGCCGATGGTAGTGTGGGGTTTCCCC
ATGTGAGAGTAGGTCATCGTCAAGATTAAATTCCAGAAACCCTCATCGCTTACGCGTTGA
GGGTTTTTGTTTGTCTGGGGTTCCAGAAACCTCTGCATTCTCTATCTGGCTCATCTCATT
GCAATGCAGCCGCATTGGCGCCAGAGACCCCCAAGGTTTAGTGAAACGCCCCCATCCCTG
In this file, each contig is one FASTA entry. The contig headers have some metadata, such as its length in base pairs, and its depth of coverage (cov_
; i.e. how many reads, on average, cover each base).
We can see a few more headers by using the grep
command, which will print lines matching a search pattern (in our case, the >
from the header), as follows:
# (Do NOT omit the quotes around the ">"!)
grep ">" results/spades/SM04/contigs.fasta | head
>NODE_1_length_1267796_cov_33.239498
>NODE_2_length_902255_cov_32.000245
>NODE_3_length_697265_cov_34.901625
>NODE_4_length_534491_cov_32.088021
>NODE_5_length_350317_cov_33.463137
>NODE_6_length_339735_cov_31.812540
>NODE_7_length_291220_cov_35.951730
>NODE_8_length_274792_cov_32.455031
>NODE_9_length_167931_cov_33.795917
>NODE_10_length_164349_cov_34.581646
|
) the grep
output to head
, you will see all headers
If we use grep
’s -c
option, it will count the number of matching lines, which will give us a count of the number of contigs:
grep -c ">" results/spades/SM04/contigs.fasta
86
4 Basic assembly stats
We can use the tools stats.sh
from the BBTools suite of genomics tools to get some (more) basic statistics for our genome assembly:
stats.sh results/spades/SM04/contigs.fasta
A C G T N IUPAC Other GC GC_stdev
0.2048 0.2969 0.2945 0.2038 0.0000 0.0000 0.0000 0.5914 0.0875
Main genome scaffold total: 86
Main genome contig total: 86
Main genome scaffold sequence total: 5.968 MB
Main genome contig sequence total: 5.968 MB 0.000% gap
Main genome scaffold N/L50: 4/534.491 KB
Main genome contig N/L50: 4/534.491 KB
Main genome scaffold N/L90: 14/97.943 KB
Main genome contig N/L90: 14/97.943 KB
Max scaffold length: 1.268 MB
Max contig length: 1.268 MB
Number of scaffolds > 50 KB: 18
% main genome in scaffolds > 50 KB: 96.64%
5 Assembly completeness check with Busco
We will use the program Busco (Manni et al. 2021, documentation) to check how complete our genome assembly is. Busco checks for the presence of genes that are expected to be universally present in a single copy in a specific taxonomic lineage.
Another commonly used program to check assembly completeness and also contamination is CheckM (Parks et al. 2015, documentation), but in the interest of time, we will only run Busco.
Because Busco will always output a number of files in your working directory, we will move into our desired output dir in advance:
cd results/busco
We will also have to load a different Conda environment, again:
source activate /fs/ess/PAS0471/jelmer/conda/busco
Using the --lineage_dataset
option, we have to tell Busco which lineage’s reference database it should use: there is a list of lineages on Busco’s website.
Exercise: Busco database
Take a look at the website linked to above. Which lineage dataset should we pick?
Click for the solution
There is a database for the order that Pseudomonas is in: pseudomonadales
.
Alternatively, we could use the database for all bacteria: bacteria
.
Otherwise, we will use the following options:
--in
— input file--out
— output ID (not the full filename)--mode genome
— our input file is a genome assembly, not a transcriptome assembly or proteome
# Run Busco
busco \
--in ../spades/SM04/contigs.fasta \
--out SM04 \
--lineage_dataset pseudomonadales \
--mode genome
# (Only showing the bit of outkey results output)
---------------------------------------------------
|Results from dataset pseudomonadales_odb10 |
---------------------------------------------------
|C:99.6%[S:99.6%,D:0.0%],F:0.1%,M:0.3%,n:782 |
|779 Complete BUSCOs (C) |
|779 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|1 Fragmented BUSCOs (F) |
|2 Missing BUSCOs (M) |
|782 Total BUSCO groups searched |
---------------------------------------------------
Click to see the full expected Busco output
2024-02-05 17:31:00 INFO: ***** Start a BUSCO v5.5.0 analysis, current time: 02/05/2024 17:31:00 *****
2024-02-05 17:31:00 INFO: Configuring BUSCO with local environment
2024-02-05 17:31:00 INFO: Mode is genome
2024-02-05 17:31:00 INFO: Downloading information on latest versions of BUSCO data...
2024-02-05 17:31:03 INFO: Input file is /fs/scratch/PAS2250/cabana/jelmer_prep/results/spades/SM04/contigs.fasta
2024-02-05 17:31:03 INFO: Downloading file 'https://busco-data.ezlab.org/v5/data/lineages/pseudomonadales_odb10.2024-01-08.tar.gz'
2024-02-05 17:31:06 INFO: Decompressing file '/fs/scratch/PAS2250/cabana/jelmer_prep/busco_downloads/lineages/pseudomonadales_odb10.tar.gz'
2024-02-05 17:31:24 INFO: Running BUSCO using lineage dataset pseudomonadales_odb10 (prokaryota, 2024-01-08)
2024-02-05 17:31:24 INFO: Running 1 job(s) on bbtools, starting at 02/05/2024 17:31:24
2024-02-05 17:31:26 INFO: [bbtools] 1 of 1 task(s) completed
2024-02-05 17:31:26 INFO: ***** Run Prodigal on input to predict and extract genes *****
2024-02-05 17:31:26 INFO: Running Prodigal with genetic code 11 in single mode
2024-02-05 17:31:26 INFO: Running 1 job(s) on prodigal, starting at 02/05/2024 17:31:26
2024-02-05 17:31:44 INFO: [prodigal] 1 of 1 task(s) completed
2024-02-05 17:31:45 INFO: Genetic code 11 selected as optimal
2024-02-05 17:31:45 INFO: ***** Run HMMER on gene sequences *****
2024-02-05 17:31:45 INFO: Running 782 job(s) on hmmsearch, starting at 02/05/2024 17:31:45
2024-02-05 17:31:55 INFO: [hmmsearch] 79 of 782 task(s) completed
2024-02-05 17:32:04 INFO: [hmmsearch] 157 of 782 task(s) completed
2024-02-05 17:32:13 INFO: [hmmsearch] 235 of 782 task(s) completed
2024-02-05 17:32:21 INFO: [hmmsearch] 313 of 782 task(s) completed
2024-02-05 17:32:28 INFO: [hmmsearch] 392 of 782 task(s) completed
2024-02-05 17:32:34 INFO: [hmmsearch] 470 of 782 task(s) completed
2024-02-05 17:32:42 INFO: [hmmsearch] 548 of 782 task(s) completed
2024-02-05 17:32:47 INFO: [hmmsearch] 626 of 782 task(s) completed
2024-02-05 17:32:53 INFO: [hmmsearch] 704 of 782 task(s) completed
2024-02-05 17:33:01 INFO: [hmmsearch] 782 of 782 task(s) completed
2024-02-05 17:33:19 INFO: Results: C:99.6%[S:99.6%,D:0.0%],F:0.1%,M:0.3%,n:782
2024-02-05 17:33:21 INFO:
---------------------------------------------------
|Results from dataset pseudomonadales_odb10 |
---------------------------------------------------
|C:99.6%[S:99.6%,D:0.0%],F:0.1%,M:0.3%,n:782 |
|779 Complete BUSCOs (C) |
|779 Complete and single-copy BUSCOs (S) |
|0 Complete and duplicated BUSCOs (D) |
|1 Fragmented BUSCOs (F) |
|2 Missing BUSCOs (M) |
|782 Total BUSCO groups searched |
---------------------------------------------------
2024-02-05 17:33:21 INFO: BUSCO analysis done. Total running time: 138 seconds
2024-02-05 17:33:21 INFO: Results written in /fs/scratch/PAS2250/cabana/jelmer_prep/SM04
2024-02-05 17:33:21 INFO: For assistance with interpreting the results, please consult the userguide: https://busco.ezlab.org/busco_userguide.html
2024-02-05 17:33:21 INFO: Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO
That is looking pretty good, 99.6% (n=779) of expected genes are indeed present completely and as a single copy. Only 0.1% (n=1) of genes are fragmented and 0.3% (n=2) are missing.
Finally, we should move back to your main project dir, and load the main Conda environment
cd ../..
source activate /fs/ess/PAS0471/jelmer/conda/cabana
6 Bonus: Assembly filtering with Kraken2
You’ll probably want to filter your assembly, based on:
- Minimum contig size
- Minimum contig depth of coverage
- Inferred contamination from other organisms
Here, we will perform the third and most complex of these.
200 bp is the minimum contig size when you upload a genome assembly to NCBI. But you may want to be more stringent, e.g. using a 300 or 500 bp threshold.
TODO - Add seqkit
command
6.1 Kraken2
To identify contaminant contigs, we will run the program Kraken2 (Wood et al. 2019, manual). This is a general purpose lowest-common ancestor (LCA) taxonomic classifier of sequences (can be reads, contigs, etc).
(It is also possible to run Kraken2 or a similar program on the reads rather than on the assembly, but this can be more error-prone due to errors and their shorter lengths compared to (most) contigs.)
Kraken requires a reference database. Ready-made databases can be downloaded from this site — in this case, I already downloaded one for you, which we can use.
Check that you are back in your bact
dir, and have the cabana Conda environment active.
pwd
# Should be:
/fs/scratch/PAS2250/cabana/$USER/bact/bact
We will start by creating an output dir for Kraken:
# (If we don't create the output dir, Kraken will not produce output files!)
mkdir results/kraken
6.2 Running Kraken
We’ll run Kraken2 with the following options:
--db /fs/scratch/PAS2250/cabana/databases/kraken_std
— the database for Kraken2--minimum-hit-groups 3
(Following recommendations from Lu et al. 2022)--confidence 0.15
(Following recommendations from Wright et al. 2023)--report
and--output
to indicate where the output files should go--threads 20
to use 20 threads- And finally, the input file (our assembly) is passed as an argument at the end of the command
# Run Kraken2 -- like with Spades, we will run it as a batch job
sbatch -A PAS2250 -t 5 -c 20 -o slurm-kraken.out --wrap="
kraken2 \
--db /fs/scratch/PAS2250/cabana/databases/kraken_std \
--minimum-hit-groups 3 \
--confidence 0.15 \
--threads 20 \
--report results/kraken/SM04_report.txt \
--output results/kraken/SM04_main.txt \
results/spades/SM04/contigs.fasta
"
Once its done (this should only take 1-2 minutes), the Slurm log file should contain the following:
cat slurm-kraken.out
Loading database information... done.
86 sequences (5.97 Mbp) processed in 0.619s (8.3 Kseq/m, 578.49 Mbp/m).
84 sequences classified (97.67%)
2 sequences unclassified (2.33%)
6.3 Interpreting the Kraken output
Let’s take a look at the output files:
ls -lh results/kraken
-rw-r--r-- 1 jelmer PAS0471 3.0M Feb 5 15:41 SM04_main.txt
-rw-r--r-- 1 jelmer PAS0471 3.5K Feb 5 15:41 SM04_report.txt
The report file (report.txt
) file has a summary of taxonomic assignments, whereas the main output file (main.txt
) has one line for each contig with its taxonomic assignment.
We’ll first take a look at the report file, which has the following columns:
- Percentage of fragments covered by the clade rooted at this taxon
- Number of fragments covered by the clade rooted at this taxon
- Number of fragments assigned directly to this taxon
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies.
- NCBI taxonomic ID number
- Indented scientific name
less -S results/kraken/SM04_report.txt
Click to show the contents of the file
2.33 2 2 U 0 unclassified
97.67 84 0 R 1 root
97.67 84 0 R1 131567 cellular organisms
73.26 63 0 D 2 Bacteria
72.09 62 3 P 1224 Pseudomonadota
68.60 59 2 C 1236 Gammaproteobacteria
58.14 50 0 O 72274 Pseudomonadales
58.14 50 1 F 135621 Pseudomonadaceae
56.98 49 12 G 286 Pseudomonas
43.02 37 1 G1 136849 Pseudomonas syringae group
41.86 36 0 G2 251695 Pseudomonas syringae group genomosp. 1
41.86 36 36 S 317 Pseudomonas syringae
8.14 7 0 O 91347 Enterobacterales
8.14 7 2 F 543 Enterobacteriaceae
5.81 5 5 G 590 Salmonella
1.16 1 0 D1 1783272 Terrabacteria group
1.16 1 0 P 201174 Actinomycetota
1.16 1 0 C 1760 Actinomycetes
1.16 1 0 O 85009 Propionibacteriales
1.16 1 0 F 31957 Propionibacteriaceae
1.16 1 0 G 1912216 Cutibacterium
1.16 1 1 S 1747 Cutibacterium acnes
24.42 21 0 D 2759 Eukaryota
24.42 21 0 D1 33154 Opisthokonta
24.42 21 0 K 33208 Metazoa
24.42 21 0 K1 6072 Eumetazoa
24.42 21 0 K2 33213 Bilateria
24.42 21 0 K3 33511 Deuterostomia
24.42 21 0 P 7711 Chordata
24.42 21 0 P1 89593 Craniata
24.42 21 0 P2 7742 Vertebrata
24.42 21 0 P3 7776 Gnathostomata
24.42 21 0 P4 117570 Teleostomi
24.42 21 0 P5 117571 Euteleostomi
24.42 21 0 P6 8287 Sarcopterygii
24.42 21 0 P7 1338369 Dipnotetrapodomorpha
24.42 21 0 P8 32523 Tetrapoda
24.42 21 0 P9 32524 Amniota
24.42 21 0 C 40674 Mammalia
24.42 21 0 C1 32525 Theria
24.42 21 0 C2 9347 Eutheria
24.42 21 0 C3 1437010 Boreoeutheria
24.42 21 0 C4 314146 Euarchontoglires
24.42 21 0 O 9443 Primates
24.42 21 0 O1 376913 Haplorrhini
24.42 21 0 O2 314293 Simiiformes
24.42 21 0 O3 9526 Catarrhini
24.42 21 0 O4 314295 Hominoidea
24.42 21 0 F 9604 Hominidae
24.42 21 0 F1 207598 Homininae
24.42 21 0 G 9605 Homo
24.42 21 21 S 9606 Homo sapiens
Exercise: Contamination?
Try to interpret the Kraken report — are there any contaminant contigs?
Click for the solution
Ouch! While the majority of our contigs have been classified as Pseudomonas syringae, we also have a few other bacteria (including the human skin bacterium Cutibacterium acnes), and no fewer than 21 human contigs!The 5th column of the Kraken report has the NCBI taxonomic IDs, and that of human (on the last line) is 9606
.
The main.txt
output file reports the taxonomic ID in the 3rd column, so we can use the following command to print just the lines where the 3rd column is 9606
:
awk '$3 == 9606' results/kraken/SM04_main.txt
C NODE_28_length_766_cov_1.003130 9606 766 9606:4 131567:5 9606:1 131567:36 9606:11 131567:1 9606:20 131567:47 9606:16 131567:2 9606:5 131567:1 9606:1 131567:1 9606:6 131567:32 9606:68 131567:18 9606:95 131567:146 9606:12 131567:4 9606:11 131567:96 9606:28 131567:5 9606:11 131567:2 9606:29 131567:2 9606:1 131567:5 9606:10
C NODE_33_length_621_cov_0.706478 9606 621 9606:180 0:1 9606:3 0:17 9606:1 0:2 9606:12 0:32 9606:199 131567:5 9606:21 131567:2 1:1 9606:111
C NODE_36_length_567_cov_0.784091 9606 567 0:6 9606:527
C NODE_37_length_563_cov_0.779817 9606 563 9606:271 0:29 9606:229
C NODE_39_length_553_cov_0.809859 9606 553 0:1 9606:7 0:3 9606:261 0:3 9606:1 0:15 9606:3 0:5 9606:220
C NODE_40_length_552_cov_0.809412 9606 552 9606:41 131567:1 9606:1 131567:17 9606:10 131567:5 9606:56 131567:5 9606:89 131567:6 9606:63 131567:3 9606:84 131567:1 9606:1 131567:17 9606:10 131567:5 9606:103
C NODE_42_length_521_cov_0.746193 9606 521 9606:349 0:9 9606:1 0:21 9606:107
C NODE_44_length_510_cov_0.906005 9606 510 9606:476
C NODE_45_length_497_cov_0.772973 9606 497 9606:218 0:47 9606:1 0:28 9606:1 0:15 9606:1 0:4 9606:140 0:8
C NODE_47_length_480_cov_0.866856 9606 480 9606:446
C NODE_48_length_479_cov_0.849432 9606 479 9606:445
C NODE_50_length_470_cov_1.011662 9606 470 9606:117 0:1 9606:5 0:1 9606:3 0:24 9606:285
C NODE_51_length_470_cov_0.915452 9606 470 9606:436
C NODE_52_length_466_cov_1.017699 9606 466 9606:20 0:17 9606:6 0:9 9606:135 0:30 9606:215
C NODE_54_length_458_cov_1.000000 9606 458 9606:213 0:4 9606:5 0:15 9606:1 0:5 9606:125 0:20 9606:2 0:5 9606:3 0:5 9606:21
C NODE_55_length_455_cov_1.179878 9606 455 9606:2 131567:3 9606:218 131567:3 9606:195
C NODE_60_length_444_cov_0.933754 9606 444 9606:410
C NODE_61_length_442_cov_0.907937 9606 442 9606:21 131567:1 9606:386
C NODE_65_length_438_cov_0.405145 9606 438 0:24 9606:2 0:8 9606:80 131567:2 9606:31 131567:19 9606:10 131567:3 9606:13 131567:7 9606:205
C NODE_66_length_433_cov_1.133987 9606 433 9606:399
C NODE_67_length_432_cov_0.718033 9606 432 9606:398
Exercise: Which contigs are contaminants?
In the output above, the contig IDs are in the second column. Do you notice anything about these? Does that provide some independent support for the idea that they are contaminants?
Click for the solution
All of these contigs are small (<600 bp) and have very low coverage (<1.2x, versus the >30x we saw for the contigs IDs that we printed earlier).
Note that Kraken doesn’t use this kind of information at all, so this provides independent evidence that this is contamination.6.4 Removing contaminant contigs
We will use the Kraken’s companion program KrakenTools (paper, documentation) to remove the contaminant contigs, with options:
-k
— main Kraken output file-s
— input sequence file to be filteredo
— output sequence file-t
— NCBI taxonomic ID (9606
= human)--exclude
— exclude (rather than extract) contigs with the specified taxonomic ID
mkdir results/decontam
extract_kraken_reads.py \
-k results/kraken/SM04_main.txt \
-s results/spades/SM04/contigs.fasta \
-o results/decontam/SM04.fasta \
-t 9606 \
--exclude
PROGRAM START TIME: 02-05-2024 22:16:21
1 taxonomy IDs to parse
>> STEP 1: PARSING KRAKEN FILE FOR READIDS results/kraken/SM04.main.txt
0.00 million reads processed
65 read IDs saved
>> STEP 2: READING SEQUENCE FILES AND WRITING READS
65 read IDs found (0.00 mill reads processed)
65 reads printed to file
Generated file: results/decontam/SM04.fasta
PROGRAM END TIME: 02-05-2024 22:16:21
Let’s check that we indeed have 65 contigs left:
grep -c ">" results/decontam/SM04.fasta
65
Footnotes
Note that because individual sequence entries are commonly spread across multiple lines, FASTA entries do not necessarily cover 2 lines (cf. FASTQ).↩︎