class:inverse middle center # *Week 11 - Python: Scientific computing* ---- # III: BioPython <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/03/25 --- ## What is BioPython? BioPython is a project with a set of related modules to enable the use of Python for various **bioinformatics tasks**, such as parsing and processing bioinformatics file formats, access to online services like NCBI Entrez and BLAST, and so on. <br> Today, we'll work through 2 or 3 small sample applications: 1. Finding and retrieving sequences from NCBI (CSB 6.4.1) 2. Reading and writing sequence data (CSB 6.4.2) 3. Programmatic BLAST search (CSB 6.4.3) --- class: center middle inverse # BioPython (CSB 6.4) ----- # I: Retrieving sequences from NCBI (6.4.1) <br> <br> <br> <br> --- ## Using BioPython to inferface with NCBI databases In the exercises for week 7 (Bioinformatics Data), you manually located some data on the NCBI's website, and then used the address you found to download it using `wget`. If we would regularly need to do download NCBI data, or would need to download data from many organisms, it would be much better to do this entire process programmatically. To do so, we can use NCBI's **"Entrez Programming Utilities"** via the `Entrez` module in BioPython. --- ## Setting up Entres - We start by importing the module – note that BioPython modules are imported using `from Bio import <focal-module>`: ```python from Bio import Entrez ``` - As another setup step, we should provide our e-mail address to let NCBI know who we are: ```python Entrez.email = "me.999@osu.edu" # Replace by your email address ``` --- ## The inquisitive shrew mole Let's say we would like to find DNA sequences for the Inquisitive Shrew Mole, *Uropsilus investigator*, that are available in GenBank and similar databases. <br> <figure> <p align="center"> <img src=img/inquisitive-shrew-mole.jpg width="55%"> <figcaption>The Inquisitive Shrew Mole is endemic to Yunnan, China. (<a href="https://theverybesttop10.com/rarest-species-of-moles-voles-and-shrews/">Image source</a>)</figcaption> </p> </figure> --- ## Perform the search We can perform the search using the **`Entrez.esearch()`** function: ```python search_handle = Entrez.esearch( db = "nuccore", term = ("Uropsilus investigator[Organism]"), retmax = 200 ) ``` - NCBI makes several databases available through Entrez, and we searched the **"Nucleotide" (or `nuccore`) database**, which includes GenBank and RefSeq data, using the `db` argument. - We created a "handle" for the results (similar to *file handles* earlier). - For NCBI searches, we can narrow down results by adding keywords between square brackets, like `[Organism]`, after our search term.<sup>[1]</sup> - We set `retmax = 200` to get all the sequence IDs (default: 20). .footnote[<sup>[1]</sup>See the Nucleotide ["Advanced Search Builder"](https://www.ncbi.nlm.nih.gov/nuccore/advanced) for all possible keywords.] --- ## Save and check the search results - Read the results and close the search handle: ```python search_results = Entrez.read(search_handle) search_handle.close() ``` <br> - Our results are returned in the form of a *dictionary*, so let's look at the keys: ```python search_results.keys() #> dict_keys(['Count', 'RetMax', 'RetStart', 'IdList', #> 'TranslationSet', 'TranslationStack', #> 'QueryTranslation']) ``` --- ## Looking at the results - How many sequences did we find? This is in the `Count` key: ```python search_results["Count"] '126' # Has increased from '71' in the book! ``` - What is the list of GenBank identifiers? This is in the `Idlist` key: ```python id_list = search_results["IdList"] print(id_list) #> ['524853022', '555947199', '555947198', ... , '555946814'] len(id_list) # Making sure we got all 126 IDs #> '126' ``` --- ## Downloading the sequences and <br> writing them to a FASTA file - We'll now use the retrieved GenBank IDs (`id = id_list`) to download the sequences in FASTA format (`rettype = "fasta"`): ```python search_handle = Entrez.efetch(db = "nuccore", rettype = "fasta", id = id_list) ``` -- - While we could parse these sequences directly from the `search_handle`, we will here take the opportunity to practice working with FASTA files. Therefore, we will start by simply writing the sequences to file, which will "automatically" be in FASTA format: ```python with open("Uropsilus_seq.fasta", "w") as fhandle_out: for line in search_handle: fhandle_out.write(line) search_handle.close() ``` -- - **The next step is to read & parse the FASTA file with the `SeqIO` module.** --- class: center middle inverse # BioPython (CSB 6.4) ----- # II: Reading and writing sequence data <br> using *SeqIO* (6.4.2) <br> <br> <br> <br> --- ## Our FASTA file - First, let's take a look at the FASTA file we saved: ```python !head Uropsilus_seq.fasta #> >KC516837.1 Uropsilus investigator isolate A11 apolipoprotein B (ApoB) gene, #> partial cds #> TTAAGTGTAAAGACTCAGTATAAGAAAAACAAAGACAAGCATTCCATCCCAATTCCTTTGGATGCATTTT #> ATGTGTTTATCAATCATAACATCAATTCTTTCATCAGGCAATTTGAGAATGGCAGACAGAAGGCTTTAGT #> TTTTGTTACCAATGTGTACAATGAAACAAAAAATAAGTTTGATCAGCACAAGGCTGAAAAGTCTCTCGAC #> AAGCAACCTAGGATCTTTCAAATCCCTGGATACTCCATTCCTGTCCTCAACATTGAAGTGTCTCCATTCA #> CAGTAAAGATGCTACCATTTGGCTATGTGATACCGGAAGAGATCAGCACCCCTAGCTTCACCATCTGGGA #> TTCTGACTTATATGTGCCCTCCTATACATTAGCCCTGCCATCCCTAGAGCTGCCAGTCCTCTCTATCCCT #> ACAACCCCTCTCAAGTTTTCTCTTCCAGAATGCAAGATGTTGAGTAACTCACAGAATATCTTAATTCCTG #> CCCTGGGCAATATTACCTATGATTTTTCCTTTAAATCAAGTGTCATCACACTGAATACCAATGTCGGCCT #> TTATAACCAATCA !wc -l Uropsilus_seq.fasta #> 1853 ``` --- ## Reading a FASTA file with SeqIO - We start by importing the BioPython `SeqIO` module: ```python from Bio import SeqIO ``` <br> - With `SeqIO.parse()`, we can create an *iterator* called a `SeqRecord`, which we will loop over to retrieve our sequences. ```python records = SeqIO.parse("Uropsilus_seq.fasta", "fasta") ``` The principle is similar to the *file handles* we worked with earlier and is particularly useful for very large FASTA files, so we don't have to load them into memory. --- ## Side note: iterators .content-box-info[ **Iterators** are object that you can, obviously iterate over – most commonly with a `for` loop. What makes iterators different from other objects that you can iterate over, such as lists or strings, is that the **full object is never created.** This way you can, for example, easily iterate over a trillion values without having to initiate a list of a trillion values, which would crash your computer. The `range()` function returns an iterator, for instance: ```python my_range = range(10 ** 12) my_range #> range(0, 1000000000000) for i in my_range: print(i, end = ',') if i > 9: break #> 0,1,2,3,4,5,6,7,8,9,10, ``` ] --- ## Reading a FASTA file with SeqIO (cont.) - Each FASTA record has several attributes, including: - `description` – everything after the `>` - `name` – first word after the `>` - `seq` – the sequence <br> <p align="center"> <img src=img/FASTA.png width="100%"> </p> --- ## Reading a FASTA file with SeqIO (cont.) - For each record, let's print the `description` and length (in bp) of `seq`: ```python for record in records: print(record.description, '\n', len(record.seq), '\n') #> KC516837.1 Uropsilus investigator isolate A11 apolipoprotein B (ApoB) gene, partial cds #> 573 #> #> KC516819.1 Uropsilus investigator voucher mlxs331 cytochrome c oxidase subunit I (COI) gene, partial cds, alternatively spliced; mitochondrial #> 912 #> #> [...] ``` -- .content-box-info[ We can only iterate through the SeqRecord once: ```python for record in records: print(record.description, '\n', len(record.seq), '\n') #> ``` ] --- ## Selecting only sequences for the BMI1 genes Next, say that we want to: - Select only records originating from a specific gene - Only take the first 100 bp of sequence for each record <br> .content-box-info[ This is quite an apt example because **parsing and subsetting FASTA files with *shell tools* is not as easy as you may expect.** This is because each FASTA record spans multiple lines – and the *sequence* for an individual records may or may not itself also be spread across multiple lines. ] --- ## Selecting only sequences for the BMI1 genes Next, say that we want to: - Select only records originating from a specific gene - Only take the first 100 bp of sequence for each record ```python output_handle = open("Uropsilus_BMI1.fasta", "w") for record in SeqIO.parse("Uropsilus_seq.fasta", "fasta"): if record.description.find("BMI1") != -1: short_seq = record[:100] # Take the first 100 bases SeqIO.write(short_seq, output_handle, "fasta") output_handle.close() ``` We used the string method `find()` to match "*BMI1*", making use of the fact that it returns `-1` when no match is found. -- .content-box-info[ Next week, we'll learn how to use the regular expression module `re` to do this more gracefully and flexibly. ] --- class: center middle inverse # BioPython (CSB 6.4) ----- # III: Programmatic BLAST search (6.4.3) <br> <br> <br> <br> --- ## NCBI BLAST The **Basic Local Alignment Search Tool (BLAST)** finds regions of similarity between biological sequences, and is one of the most widely used bioinformatics tools. You may know it by its web interface, where you can paste in or upload some sequences, and then "BLAST them" against one of the NCBI databases: <figure> <p align="center"> <img src=img/blast_screenshot.png width="80%"> <figcaption>Screenshot from <a href="https://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&PAGE_TYPE=BlastSearch&LINK_LOC=blasthome">NCBI BLAST</a>.</figcaption> </p> </figure> --- ## Getting set up for BLAST with the Blast module Using the web interface may be convenient when you just have one or a few sequences, but it is **not very reproducible, tedious to repeat, and unsuitable for automation.** Luckily, BioPython provides the `Bio.Blast` package with which we can programmatically run BLAST searches against the NCBI's online databases.<sup>[1]</sup> .footnote[ <sup>[1]</sup> Note, there is also a *command-line* BLAST tool to do these things in the shell. ] -- ```python # from Bio import SeqIO # Should already be loaded # from Bio import Entrez # Should already be loaded from Bio.Blast import NCBIWWW, NCBIXML ``` We imported two modules from the `Bio.Blast` package: - `NCBIWWW` for access to NCBI's BLAST server – note that the BLAST search is still being performed there. - `NCBIXML` to parse the XML format in which our results will be returned. --- ## Storing all FASTA records in a list We will start by parsing our FASTA file again with `SeqIO`. This time, however, we will read the entire file into memory, by converting the `SeqRecord` iterator to a list: ```python records = list(SeqIO.parse("Uropsilus_BMI1.fasta", "fasta")) ``` A list can be more convenient than an iterator –e.g., we can index it and extract things in a specific order– and is not a problem performance-wise when working with a relatively small FASTA file. --- ## Our list of FASTA records What does this list look like? Let's print the first 2 items: ```python records[0:2] #> [SeqRecord(seq=Seq('TATTATGCTGTTTTGTGAACCTGTAGAAAACAAGTGCTTTTTATCTTGAAATTC...CCA', SingleLetterAlphabet()), id='KF778086.1', name='KF778086.1', description="KF778086.1 Uropsilus investigator voucher KIZ:020539 polycomb ring finger oncoprotein (BMI1) gene, 3' UTR", dbxrefs=[]), #> SeqRecord(seq=Seq('TATTATGCTGTTTTGTGAACCTGTAGAAAACAAGTGCTTTTTATCTTGAAATTC...CCA', SingleLetterAlphabet()), id='KF778085.1', name='KF778085.1', description="KF778085.1 Uropsilus investigator voucher KIZ:020527 polycomb ring finger oncoprotein (BMI1) gene, 3' UTR", dbxrefs=[])] ``` <br> From this list, we can still extract record attributes – for instance, we can print the `id` and `seq` for the 4th record: ```python print(records[3].id, " ", records[3].seq) #> KF778083.1 TATTATGCTGTTTTGTGAACCTGTAGAAAACAAGTGCTTTTTATCTTGAAATTCAACAAATGGAAAGAATATGCATAGAATAATGCATTCTATGTAGCCA ``` --- ## Run the BLAST search For our BLAST search, we will use the nucleotide database (`nt`) with the `blastn` algorithm, the BLAST variant for standard nucleotide to nucleotide searches. The function we use to do this is `qblast()`, which requires three arguments: - The program – `blastn` for us - The database – `nt` for us - The query sequence – the 4th record from our FASTA file ```python Entrez.email = "me.999@osu.edu" result_handle = NCBIWWW.qblast("blastn", "nt", records[3].seq) ``` --- ## Write the results to a file and then parse the file Our results are in XML format. Again, we could parse this directly, but it can be useful to save them to file so you have the raw results from the BLAST search (which can be slow) safely stored: ```python with open("my_blast.xml", "w") as fhandle: fhandle.write(result_handle.read()) result_handle.close() ``` -- <br> We will parse our XML file with BLAST results using the `NCBIXML` parser:<sup>[1]</sup> .footnote[<sup>[1]</sup> Because we our *query* consisted only of a single sequence, we could use `NCBIXML.read()`. For multiple sequences, you should use `NCBIXML.parse`.] ```python blast_records = NCBIXML.read(open("my_blast.xml")) ``` --- ## Processing the BLAST results Next, we loop through the individual alignment hits and then for each hit through the "High-Scoring Pairs" (HSPs, subsequences with a good match). For each High-Scoring Pair, we check whether the match is good enough according to arbitrary E-value (`E_VALUE_THRESH`) and length (`LEN_THRESH`) thresholds that we set: ```python E_VALUE_THRESH = 0.04 LEN_THRESH = 3000 for align in blast_records.alignments: # Loop through the alignments for hsp in align.hsps: # For each alignment, loop through HSPs if hsp.expect < E_VALUE_THRESH and align.length > LEN_THRESH: print("sequence:", align.title) print("length:", align.length) print("E value:", hsp.expect, '\n') #> sequence: gi|1835213484|ref|XM_019933391.2| PREDICTED: Tursiops truncatus BMI1 proto-oncogene, polycomb ring finger (BMI1), transcript variant X6, mRNA #> length: 3027 #> E value: 5.55456e-42 #> #> sequence: gi|1835213483|ref|XM_019933388.2| PREDICTED: Tursiops truncatus BMI1 proto-oncogene, polycomb ring finger (BMI1), transcript variant X5, mRNA #> length: 3141 #> E value: 5.55456e-42 ``` --- ## <i class="fa fa-user-edit"></i> BioPython exercises for homework I have included the CSB section on querying PubMed with BioPython's `Entrez` module as a "tutorial exercise" for this week, and the next exercise also works with `Entrez` and PubMed. --- class: center middle inverse # Questions? ----- <br> <br> <br> <br>