Michael Ruggiero of the Integrated Taxonomic Information System has led the World Bee Checklist project, aiming to collect taxonomic information on all the bee species in the world.
In the file CSB/regex/data/bee_list.txt
1, you can find a list of about 20,000 species (!), along with their TSN
(the identifier in the ITIS database), and a column detailing the authors and year of publication of documents describing the species.
What is the name of the author with the most entries in the database? To find out, you’ll need to parse the citations in the file.
Note that you need to account for different citation formats that occur in the file, such as:
(Morawitz, 1877)
Cockerell, 1901
(W. F. Kirby, 1900)
Meade-Waldo, 1914
Eardley & Brooks, 1989
Lepeletier & Audinet-Serville, 1828
Michener, LaBerge & Moure, 1955
The difficult part of the exercise is to write a regular expression capable of capturing a wide variety of styles used to store the authors’ names: there may or may not be a parenthesis at the beginning/end of the string; the initials for the authors may or may not be reported, author names can be split using commas or ampersands; some names are hyphenated.
Alternatively, it is also possible to match the author names quite generically (non-specifically), and instead make use of the fact that author names are consistently followed by a comma and a 4-digit year.
In your regular expression, capture the author name(s) and the year as separate groups (with ()
). Then, you can retrieve the author names and year from your match using the group()
method.
Start by testing your regular expression on one or a few citation strings.
To make sure you’ve captured all authors, check that the number of names matches the number of records.
Because we want to count publications for individual authors and not for combinations of authors, you will need to write a second regular expression that splits the list of authors using commas or &
. Also be sure to include any necessary spaces in the splitting pattern!
Next, you want to create a dictionary and count the number of occurrences of each author.
To get the author with the most occurrences, the CSB solution finds the maximum value with max_value_author = max(dict_authors.values())
, then turns the dictionary values into a list and gets the index for the maximum value, and finally turns the dictionary keys into a list and extracts the key associated with the maximum value.
However, in last week’s exercises, we saw that we can use the sorted()
function to do this a little more easily: get a sorted list of keys using sorted(author_dict, key=author_dict.get, reverse=True)
and then print the first value in this sorted list. (Note that in the sorted function, the key
argument expects a function to be applied: author_dict.get
gets the values, and then sorted()
will sort the keys by those values.)
Pseudocode:
open the file
extract species name
extract author/date string
use re.match to extract i) author list, ii) date
now go through each author list, and split the authors
create a dictionary for authors to count the number of occurences for each author
get the author with the highest number of occurences
You should be able to extract the year for each publication with group()
just like you did above for the authors, provided you created a group with ()
that captures the year in your regular expression.
Then, you should build a dictionary with counts for each year, just like you did for authors above – though here, you don’t need to preprocess the regular expression match since there is only year in each citation.
The last step is also very similar to what you did above: you’ll need to extract the year with the highest number of occurrences from your dictionary.
See the CSB notebook with the solutions.
Where does science come from? This question has fascinated researchers for decades, and has even led to the birth of the field of the “science of science,” where researchers use the same tools they invented to investigate nature to gain insights into the development of science itself. In this exercise, you will build a “map of Science,” showing where articles published in the magazine Science have originated.
You will find two files in the directory CSB/regex/data/MapOfScience
2. The first, pubmed_results.txt
, is the output of a query to PubMed, listing all the papers published in Science in 2015. You will extract the US ZIP codes from this file, and then use the file zipcodes_coordinates.txt
to extract the geographic coordinates for each ZIP code.
pubmed_results.txt
, and extract all the US ZIP codes (5-digit numbers).While you may be able to match only ZIP codes by creating a pattern that just matches 5 digits, it will be safer to match them using a more extensive pattern: two-letter state code => space => 5 digits => comma and space => USA.
To be able to match the pattern above, you need to account for the fact that the pattern may be split across two lines. Therefore, it’s best to read the entire file into a single string using the read()
function (<my_file_handle>.read()
), and then replacing newlines (\n
) followed by six spaces by a single space using the re.sub()
function.
Next, you will need a re
function that returns all matches and not just a single match, since the entire file is now a single string. Recall that re.findall()
will simply return a list of matches and not a match object.
Note also that with re.findall()
, if you use parentheses to group a match (i.e. the 5 digits of the ZIP code), it will return the group(s) only, which is exactly what you want!
Pseudocode:
open the file and read all of the text using f.read()
my_text = re.sub(your regex here, ' ', my_text)
use re.findall to extract all ZIP codes
zipcodes = re.findall(another regex here, my_text)
zip_code
, zip_long
, zip_lat
, and zip_count
, containing the unique ZIP codes, their longitudes, latitudes, and counts (number of occurrences in Science), respectively.You need to first save and count the number of unique ZIP codes, and then extract the corresponding longitude and latitude from the file for each ZIP code from the file zipcodes_coordinates.txt
.
Pseudocode:
# list of distinct zipcodes
unique_zipcodes = list(set(zipcodes))
for each zipcode:
extract number of occurrences
extract latitude and longitude from zipcodes_coordinates.txt
import matplotlib.pyplot as plt
%matplotlib inline
= zip_count, c= zip_count)
plt.scatter(zip_long, zip_lat, s
plt.colorbar()
# Only plot the continental US without Alaska:
-125,-65)
plt.xlim(23, 50)
plt.ylim(
# Add a few cities for reference (optional):
= dict(arrowstyle="->")
ard 'Los Angeles', xy = (-118.25, 34.05),
plt.annotate(= (-108.25, 34.05), arrowprops = ard)
xytext 'Palo Alto', xy = (-122.1381, 37.4292),
plt.annotate(= (-112.1381, 37.4292), arrowprops= ard)
xytext 'Cambridge', xy = (-71.1106, 42.3736),
plt.annotate(= (-73.1106, 48.3736), arrowprops= ard)
xytext 'Chicago', xy = (-87.6847, 41.8369),
plt.annotate(= (-87.6847, 46.8369), arrowprops= ard)
xytext 'Seattle', xy = (-122.33, 47.61),
plt.annotate(= (-116.33, 47.61), arrowprops= ard)
xytext 'Miami', xy = (-80.21, 25.7753),
plt.annotate(= (-80.21, 30.7753), arrowprops= ard)
xytext
= plt.gcf()
params = params.get_size_inches()
plSize 0] * 3, plSize[1] * 3) )
params.set_size_inches( (plSize[ plt.show()
See the CSB notebook with the solutions.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".