Open a new file and save it as week09_exercises.py
or something along those lines.
Type your commands in the script and send them to the prompt in the Python interactive window by pressing Shift+Enter.
If this doesn’t work, check your keyboard shortcut by right-clicking in the script and looking for “Run Selection/Line In Python Interactive Window”.
Also, you can open the Command Palette (Ctrl+Shift+P) and look for that shortcut there, and change it if you want.
From the CSB Chapter 3 preface to the exercises:
Here are some practical tips on how to approach the Python exercises (or any programming task):
- Think through the problem before starting to write code: Which data structure would be more convenient to use (e.g., sets, dictionaries, lists)?
- Break the task down into small steps (e.g., read file input, create and fill data structure, output).
- For each step, describe in plain English what you are trying to do— leave these notes as comments within your program to document your code.
- When working with large files, initially use only a small subset of the data; once you have tested your code thoroughly you can run it on the whole data set.
- Consider using specific modules (e.g., use the csv module to parse each line into a dictionary or a list).
- Skim through appropriate sections above to refresh your memory on data-type-specific methods.
- Use the documentation and help forums.
In their article, Dalziel et al. (2016) provide a long time series reporting the numbers of cases of measles before mass vaccination, for many US cities. The data consist of cases in a given US city for a given year, and a given “biweek” of the year (i.e., first two weeks, second two weeks, etc.). The time series is contained in the file Dalziel2016_data.csv
.
While you could try to parse the file from scratch (you have learnt the building blocks to do so), using the DictReader
from the csv
module, as we did in class, will make this easier.
The city name is in the column loc
.
Because each city is reported multiple times, the main task here is to remove duplicates. Using a set
will be the easiest way to do so, since sets cannot contain duplicates.
You don’t need to write to a new file here, just print the set
after you are done processing the file.
Pseudocode:
import csv
cities = an empty set
open data for reading
create dictionary reader
for each row in the file
add the city to the set
Initialize an empty dictionary before you start looping over the lines.
For every line, extract the city name and add 1 to the value for that city in your dictionary, since you are counting rows.
You don’t need to prepopulate the dictionary with all cities: when you provide a default value with the get()
method, a key that is not yet present will be added to the dictionary with said default value.
For example, we can build up a dictionary using get()
like so:
= {} # empty dictionary
dd = ['a', 'b', 'a', 'c', 'd', 'b', 'a']
my_list for element in my_list:
= dd.get(element, 0) + 1
dd[element]
print(dd)
{'a': 3, 'b': 2, 'c': 1, 'd': 1}
Pseudocode:
import csv library
citycount = an empty dictionary
open file for reading
set up dictionary reader
for each line in data
my_city = extract the city
citycount[my_city] = use get to update value
pop
.
Note that for some reason, the population sizes have decimal values.
Again, use a dictionary that you keep adding to for each row of the data set. This time, though, each value in the dictionary should be a list of two items: the total population, and the number of occurences.
In your get()
call, you can initialize the values to be a list of two items as follows (here assuming the dictionary is called citypop
and the city’s name has been extracted as mycity
):
= citypop.get(mycity, [0, 0]) citypop[mycity]
Then, you can refer to each item in the dictionary’s values by chaining indices, e.g. citypop[mycity][0]
.
Pseudocode:
import csv
citypop = an empty dictionary
open data file reading
set up dictionary reader
for each line in data
my_city = extract the city
my_pop = extract population
if this is the first time you see this city, initialize:
citypop[my_city] = [0.0, 0]
citypop[my_city][0] = what it was before + my_pop
citypop[my_city][1] = what it was before + 1
for each city
divide the first element by the second to obtain the mean
You can do this in (at least) two ways with a dictionary:
By creating a nested dictionary: each city is a dictionary, which itself contains a dictionary for each year.
By using a (city, year)
tuple as the keys for the dictionary.
Note that the worked-out solution in the link below uses the first strategy.
Singh et al. (2015) show that, when infected with a parasite, the four genetic lines of D. melanogaster respond by increasing the production of recombinant offspring (arguably, trying to produce new recombinants able to escape the parasite). They show that the same outcome is not achieved by artificially wounding the flies. The data needed to replicate the main claim (figure 2 of the original article) is contained in the file Singh2015_data.csv
.
Open the file, and compute the mean RecombinantFraction
for each Drosophila genetic line, and InfectionStatus
(W
for wounded and I
for infected).
Print the results in the following form:
Line 45 Average Recombination Rate:
W : 0.187
I : 0.191
For each Dropsophila genetic line, you need to keep track of all the recombination rates for W (wounded) and I (infected).
For example, you could build a dictionary of dictionaries in which the first (outer) dictionary has a key for each line, and the inner dictionary has a key for each status (W
or I
) and a list of recombination rates as each value.
Then, you would calculate averages for each list at the end.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".