Exercises: Week 12 – Regular expressions


Exercise CSB-1 (5.9.1): Bee checklist

Michael Ruggiero of the Integrated Taxonomic Information System has led the World Bee Checklist project, aiming to collect taxonomic information on all the bee species in the world.

In the file CSB/regex/data/bee_list.txt1, you can find a list of about 20,000 species (!), along with their TSN (the identifier in the ITIS database), and a column detailing the authors and year of publication of documents describing the species.

  1. What is the name of the author with the most entries in the database? To find out, you’ll need to parse the citations in the file.

    Note that you need to account for different citation formats that occur in the file, such as:

    (Morawitz, 1877)
    Cockerell, 1901
    (W. F. Kirby, 1900)
    Meade-Waldo, 1914
    Eardley & Brooks, 1989
    Lepeletier & Audinet-Serville, 1828
    Michener, LaBerge & Moure, 1955
Hints
open the file
   extract species name
   extract author/date string
   use re.match to extract i) author list, ii) date
   now go through each author list, and split the authors
   create a dictionary for authors to count the number of occurences for each author 
   get the author with the highest number of occurences   


  1. Which year of publication occurs most often in the database?
Hints

You should be able to extract the year for each publication with group() just like you did above for the authors, provided you created a group with () that captures the year in your regular expression.

Then, you should build a dictionary with counts for each year, just like you did for authors above – though here, you don’t need to preprocess the regular expression match since there is only year in each citation.

The last step is also very similar to what you did above: you’ll need to extract the year with the highest number of occurrences from your dictionary.


Solutions for both steps

See the CSB notebook with the solutions.


Bonus – Exercise CSB-2 (5.9.2): A map of science

Where does science come from? This question has fascinated researchers for decades, and has even led to the birth of the field of the “science of science,” where researchers use the same tools they invented to investigate nature to gain insights into the development of science itself. In this exercise, you will build a “map of Science,” showing where articles published in the magazine Science have originated.

You will find two files in the directory CSB/regex/data/MapOfScience2. The first, pubmed_results.txt, is the output of a query to PubMed, listing all the papers published in Science in 2015. You will extract the US ZIP codes from this file, and then use the file zipcodes_coordinates.txt to extract the geographic coordinates for each ZIP code.

  1. Read the file pubmed_results.txt, and extract all the US ZIP codes (5-digit numbers).
Hints


  1. Create the lists zip_code, zip_long, zip_lat, and zip_count, containing the unique ZIP codes, their longitudes, latitudes, and counts (number of occurrences in Science), respectively.
Hints
# list of distinct zipcodes
unique_zipcodes = list(set(zipcodes))
for each zipcode:
    extract number of occurrences
    extract latitude and longitude from zipcodes_coordinates.txt


  1. To visualize the data you’ve generated, use the code below.
Code for the plot
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(zip_long, zip_lat, s = zip_count, c= zip_count)
plt.colorbar()

# Only plot the continental US without Alaska:
plt.xlim(-125,-65)
plt.ylim(23, 50)

# Add a few cities for reference (optional):
ard = dict(arrowstyle="->")
plt.annotate('Los Angeles', xy = (-118.25, 34.05), 
               xytext = (-108.25, 34.05), arrowprops = ard)
plt.annotate('Palo Alto', xy = (-122.1381, 37.4292), 
               xytext = (-112.1381, 37.4292), arrowprops= ard)
plt.annotate('Cambridge', xy = (-71.1106, 42.3736), 
               xytext = (-73.1106, 48.3736), arrowprops= ard)
plt.annotate('Chicago', xy = (-87.6847, 41.8369), 
               xytext = (-87.6847, 46.8369), arrowprops= ard)
plt.annotate('Seattle', xy = (-122.33, 47.61), 
               xytext = (-116.33, 47.61), arrowprops= ard)
plt.annotate('Miami', xy = (-80.21, 25.7753), 
               xytext = (-80.21, 30.7753), arrowprops= ard)

params = plt.gcf()
plSize = params.get_size_inches()
params.set_size_inches( (plSize[0] * 3, plSize[1] * 3) )
plt.show()
Solutions for all steps

See the CSB notebook with the solutions.



  1. If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎

  2. If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".