pracs-sp21: Exercises: Week 9

Setup

Open a new file and save it as week09_exercises.py or something along those lines.
Type your commands in the script and send them to the prompt in the Python interactive window by pressing Shift+Enter.

Problems with the keyboard shortcut?

If this doesn’t work, check your keyboard shortcut by right-clicking in the script and looking for “Run Selection/Line In Python Interactive Window”.

Also, you can open the Command Palette (Ctrl+Shift+P) and look for that shortcut there, and change it if you want.

Intro to CSB exercises

From the CSB Chapter 3 preface to the exercises:

Here are some practical tips on how to approach the Python exercises (or any programming task):

Think through the problem before starting to write code: Which data structure would be more convenient to use (e.g., sets, dictionaries, lists)?

Break the task down into small steps (e.g., read file input, create and fill data structure, output).

For each step, describe in plain English what you are trying to do— leave these notes as comments within your program to document your code.

When working with large files, initially use only a small subset of the data; once you have tested your code thoroughly you can run it on the whole data set.

Consider using specific modules (e.g., use the csv module to parse each line into a dictionary or a list).

Skim through appropriate sections above to refresh your memory on data-type-specific methods.

Use the documentation and help forums.

Exercise CSB-1: Measles time series

In their article, Dalziel et al. (2016) provide a long time series reporting the numbers of cases of measles before mass vaccination, for many US cities. The data consist of cases in a given US city for a given year, and a given “biweek” of the year (i.e., first two weeks, second two weeks, etc.). The time series is contained in the file Dalziel2016_data.csv.

Write a program that extracts the names of all the cities in the database (one entry per city).

Hints

While you could try to parse the file from scratch (you have learnt the building blocks to do so), using the DictReader from the csv module, as we did in class, will make this easier.
The city name is in the column loc.
Because each city is reported multiple times, the main task here is to remove duplicates. Using a set will be the easiest way to do so, since sets cannot contain duplicates.
You don’t need to write to a new file here, just print the set after you are done processing the file.

Pseudocode:

import csv
cities = an empty set
open data for reading
create dictionary reader
for each row in the file
    add the city to the set

Write a program that creates a dictionary where the keys are the cities and the values are the number of records (rows) for that city in the data.

Hints

Initialize an empty dictionary before you start looping over the lines.
For every line, extract the city name and add 1 to the value for that city in your dictionary, since you are counting rows.
You don’t need to prepopulate the dictionary with all cities: when you provide a default value with the get() method, a key that is not yet present will be added to the dictionary with said default value.

For example, we can build up a dictionary using get() like so:
```
dd = {} # empty dictionary
my_list = ['a', 'b', 'a', 'c', 'd', 'b', 'a']
for element in my_list:
    dd[element] = dd.get(element, 0) + 1

print(dd)
```
```
{'a': 3, 'b': 2, 'c': 1, 'd': 1}
```

Pseudocode:

import csv library
citycount = an empty dictionary
open file for reading
set up dictionary reader
  for each line in data
        my_city = extract the city
        citycount[my_city] = use get to update value

Write a program that calculates the mean population for each city, obtained by averaging the values of pop.

Hints

Note that for some reason, the population sizes have decimal values.
Again, use a dictionary that you keep adding to for each row of the data set. This time, though, each value in the dictionary should be a list of two items: the total population, and the number of occurences.
In your get() call, you can initialize the values to be a list of two items as follows (here assuming the dictionary is called citypop and the city’s name has been extracted as mycity):
```
citypop[mycity] = citypop.get(mycity, [0, 0])
```
Then, you can refer to each item in the dictionary’s values by chaining indices, e.g. citypop[mycity][0].
Pseudocode:

import csv
citypop = an empty dictionary
open data file reading
set up dictionary reader

for each line in data
  my_city = extract the city
  my_pop = extract population
  if this is the first time you see this city, initialize:
     citypop[my_city] = [0.0, 0]
  citypop[my_city][0] = what it was before + my_pop
  citypop[my_city][1] = what it was before + 1

for each city
  divide the first element by the second to obtain the mean

Write a program that calculates the mean population for each city and year.

Hints

You can do this in (at least) two ways with a dictionary:
- By creating a nested dictionary: each city is a dictionary, which itself contains a dictionary for each year.
- By using a (city, year) tuple as the keys for the dictionary.

Note that the worked-out solution in the link below uses the first strategy.

Bonus: Exercise CSB-2: Red queen in fruit flies

Singh et al. (2015) show that, when infected with a parasite, the four genetic lines of D. melanogaster respond by increasing the production of recombinant offspring (arguably, trying to produce new recombinants able to escape the parasite). They show that the same outcome is not achieved by artificially wounding the flies. The data needed to replicate the main claim (figure 2 of the original article) is contained in the file Singh2015_data.csv.

Open the file, and compute the mean RecombinantFraction for each Drosophila genetic line, and InfectionStatus (W for wounded and I for infected).

Print the results in the following form:

Line 45 Average Recombination Rate:
W : 0.187
I : 0.191

Hints

For each Dropsophila genetic line, you need to keep track of all the recombination rates for W (wounded) and I (infected).

For example, you could build a dictionary of dictionaries in which the first (outer) dictionary has a key for each line, and the inner dictionary has a key for each status (W or I) and a list of recombination rates as each value.

Then, you would calculate averages for each list at the end.

Exercises: Week 9

Setup

Intro to CSB exercises

Exercise CSB-1: Measles time series

Bonus: Exercise CSB-2: Red queen in fruit flies

CSB Solutions

Reuse