Exercises for week 11: Scientific computing


The first and last of these exercises are more short tutorials working through two of the examples in the book.

The first two exercises use BioPython, the third uses Pandas, and the last uses NumPy. As we’re getting into more advanced and specialized material, you may be much more interested in some approaches/packages than others, and it is fine if you direct your attention accordingly.

For example, if you are interested in one of these three but not so much the other two, feel free to skip the other exercises. And if you have time to spare, I would rather recommend digging a bit deeper into your topic of interest:


1: Tutorial exercise
Querying PubMed (CSB 6.4.4)

Using the Entrez.esearch() function we used in class, we can search any NCBI database, including PubMed, which is a comprehensive scientific literature database.

By way of example, we will search for any papers on Drosophila that mention the gene “spaetzle” anywhere in the title or abstract.1

Bonus using regular expressions in Python

Alternatively, we could use regular expressions in Python to nicely retrieve each sentence (rather than each line) that contains “Spaetzle”, and also print the PubMedID (“PMID”) for the publication. You will learn how to use regular expressions like this in Python next week!

import re   # re is Python'r regular expression module

with open("Spaetzle_abstracts.txt") as datafile:

pubmed_input = datafile.read()

# To get titles and abstracts on one line: delete newlines + 6 spaces:
pubmed_input = re.sub(r"\n\s{6}", " ", pubmed_input)

for line in pubmed_input.split("\n"):

    # re.match()'s output will be interpreted as True only if a match is found:
    # (Here we just use a literal search from "PMID")
    if re.match("PMID", line):

         # We find and extract the PubMed ID by matching one or more digits;
         # group() will return the match:
         PMID = re.search(r"\d+", line).group()

         if re.match("AB", line):

              # We look for *all* sententces that contain "Spaetzle" using findall():
              if re.findall(r"([^.]*?Spaetzle[^.]*\.)", line):
                    print("PubMedID: ", PMID, " ", spaetzle)

#> PubMedID:  32591083   [' These events promote the ventral processing of Spaetzle, a ligand for Toll,   which ultimately sets up the embryonic dorsal-ventral axis.']
#> PubMedID:  31719046   ['  The dynamics indicate that a sharp extracellular gradient is formed through   diffusion-based shuttling of the Spaetzle (Spz) morphogen that progresses through several nuclear   divisions.']
#> PubMedID:  27314646   [' While cytokines activating immune responses,  such as Spaetzle or Unpaired-3,   have been identified and


2: Exercise CSB-1:
Lord of the fruit flies

(You will need to have run through the previous tutorial exercise to be able to do this one.)

Suppose you need information on how to breed Drosophila virilis in your laboratory and you would like to contact an expert. Conduct a PubMed query on who has published most contributions on D. virilis. This person might be a good researcher to contact.

  1. Identify how many papers in the PubMed database have the words Drosophila virilis in their title or abstract.
Hints

Pseudocode:

import the Entrez module
Entrez.email = "your email here"

# Create a handle:
handle = Entrez.esearch(your code here)

# Retrieve the records
record = Entrez.read(handle)
close the handle

save WebEnv from your record
save QueryKey from your records

Solutions

# Import the Entrez module:
from Bio import Entrez

# Tell NCBI who you are:
Entrez.email = "me.1@osu.edu" # EDIT TO YOUR ADDRESS

# Perform the search:
handle = Entrez.esearch(db = "pubmed",
                        term = "Drosophila virilis[Title/Abstract]",
                        usehistory = "y")
# Save the results from the search handle in "record" and close the handle:                 
record = Entrez.read(handle)
handle.close()

# Print the number of matching papers:
print(record["Count"])

# Save the WebEnv and QueryKey for later use:
webenv = record["WebEnv"]
query_key = record["QueryKey"]

For the similar CSB solutions to all steps in this exercise, see here.


  1. Retrieve the PubMed entries that were identified in step 1, and write them to a new file called D_virilis_pubs.txt.
Hints

Pseudocode:

handle = fetch records from "pubmed" db with "medline" rettype, "text" retmode,
              600 retmax, and the saved webenv and query_key
data = read from handle
close the handle

with open output file as handle
    write the data

Solutions

# Perform the search:
shandle = Entrez.efetch(db = "pubmed",
                        rettype = "medline",
                        retmode = "text",
                        retstart = 0, retmax = 600,
                        webenv = webenv, query_key = query_key)

# Save results from the search handle in "data" and close the handle:
data = shandle.read()
shandle.close(
with open("D_virilis_pubs.txt", "w") as fhandle:
    fhandle.write(data)


  1. Count the number of contributions per author.

    Start by taking a look at your D_virilis_pubs.txt file to see how you can match lines that contain author names.

Hints

Pseudocode:

with open D_virilis_pubs.txt
     initialize empty dict
     for line in fhandle
          if line contains "AU -"
              split the line by "-" and take the 2nd element
              strip whitespace
              initialize author if not in dict and add 1 to keep count

Solutions

with open("D_virilis_pubs.txt") as fhandle:

# Initialize an empty dictionary:
author_dict = {}

# Loop through each line:
      for line in fhandle:

      # Search for lines with "AU  -", which contain authors:
      if line.find("AU  -") != -1:

      # The author name is after the "-",
      # so we split by "-" and take the 2nd element:
      author = line.split("-", 1)[1]

      # Then, we remove any leading and trailing whitespace:
      author = author.strip()

      # If key (=author) is present, add 1, otherwise, initialize at 1:
      author_dict[author] = author_dict.get(author, 0) + 1


  1. For the the five authors with the most papers on D. virilis, print each name and the corresponding number of papers.

    You’ll have to use the sorted() in a way we have not seen yet, see the Hints below for more details.

Hints

To get the top authors, you can use the sorted() function on your dictionary. You’ll have two provide two additional arguments:

Then, take the first 5 items in the resulting list, which will be the top 5 authors.

Finally, loop through your list of top authors and print their names (keys) and number of papers (values).

Pseudocode:

sorted_authors = sort dict by value in reverse order
top_authors = take top 5 from sorted_authors
for author in top authors
    print the name and the number of papers

Solutions

sorted_authors = sorted(author_dict, key = author_dict.get, reverse = True)
top_authors = sorted_authors[:5]

for author in top_authors:
    print(author, ":", author_dict[author])

#> Gruntenko NE : 36
#> Evgen'ev MB : 31
#> Hoikkala A : 24
#> Raushenbakh IIu : 24
#> Korochkin LI : 22


3: Exercise CSB-2: Rejection rates

Fox et al. (2016) studied the effects on the outcome of papers of the genders of the handling editors and reviewers. For the study, they compiled a database including all the submissions to the journal Functional Ecology from 2004 to 2014. Their data are reported in CSB/scientific/data/Fox2015_data.csv.2

Besides the effects of gender and bias in journals, the data can be used to investigate whether manuscripts having more reviewers are more likely to be rejected. Note that this hypothesis should be tested for reviewed manuscripts, that is, excluding “desk rejections” without review.

  1. Import the data using Pandas, and count the number of reviewers (by summing ReviewerAgreed) for each manuscript (i.e., unique MsID). The column FinalDecision contains 1 for rejection, and 0 for acceptance.

    Compile a table measuring the probability of rejection given the number of reviewers. Does the probability of being rejected seem to increase with the number of reviewers?

Hints

With an eye on the next step, where you have to do the same thing for each year, it is convenient to write a function that takes the data and a calendar year as input, and prints the probability of rejection given the number of reviewers for that given year.

We can set the function to return the general rejection rate if “all” instead of a year is specified as an argument.

Pseudocode:

import pandas
import numpy as np

# read the data using pandas
# (assuming you are in 'CSB/scientific/sandbox')
fox = pandas.read_csv("../data/Fox2015_data.csv")

use a combination of list and set() to extract the unique `MsID`

now go through each manuscript and store:
i) the final decision (reject/accept) in the np.array final_decision
ii) the number of reviewers in the np.array num_reviewers
iii) the submission year in the np.array year

def get_prob_rejection(my_year = "all"):
    if my_year == "all":
        do not subset the data
    else:
        subset the data to use only the specified year
    for each number of reviewers:
        compute probability of rejection and produce output


  1. Write a function to repeat the analysis above for each year represented in the database.

    For example, for the year 2009, your function should return:

    Year: 2009
    Submissions: 626
    Overall rejection rate: 0.827
    NumRev    NumMs   rejection rate
    0   306   0.977
    1   2     0.5
    2   228   0.68
    3   86    0.698
    4   4     0.75
Hints

If your function doesn’t already take a calendar year as as an argument, modify your function to do so.

Then loop through the years of 2004-2014 (inclusive) and run your function for each year.


Solutions for both steps

See the CSB notebook with the solutions.

The notebooks sometimes don’t manage to load on GitHub; if not, try refreshing and otherwise go here and look at the PDF version instead.

Note: The solution uses some “plain Python” approaches where specialized Pandas functions are also available to do the same thing more succinctly. That is fine, and makes sense given that we have only had a quick introduction to Pandas and did not learn about these more advanced functions — but it may be good to be aware of this.

If you want to try out specialized Pandas approaches for this exercise, look into the groupby method in particular.


4: Tutorial exercise
Image processing with NumPy

Plotting works out of the Box in the VS Code interactive window.
If you’re using a Jupyter Notebook, you would also need to make Matplotlib image plotting available using the following IPython “magic function”:

%matplotlib inline

  1. See here for PubMed search options.↩︎

  2. If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".