pracs-sp21: Exercises for week 10

Exercise 1: More lists

Start with a same list as you created in the exercises for week 8:

Sort diseases in place.

Solution

The sort() method sorts a list in place:

['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'root_rot', 'stem_blight', 'wilt']

Instead of sorting in place with the sort() method like in the previous step, you can also use the sorted() function, which will not sort in place but return a new, sorted list.

Find out how to use sorted() to sort in reverse order, and apply this to diseases to create a new list diseases_sorted.

Solution

We can use the reverse argument to sorted() to sort in reverse order:

If you would run fewer_diseases = diseases.remove("root_rot"), what would fewer_diseases contain? Think about what the answer should be, and then check if you were right. Does simply running fewer_diseases versus running print(fewer_diseases) make a difference?

Solution

Because remove() operates in place, it doesn’t return anything:

Well, it actualy returns None, which you can see by explicitly calling the print() function:

None

If you would run:

Would the list diseases also contain crown_galls? Think about what the answer should be, and then check if you were right.

Solution

Yes, diseases will contain the item crown_galls that was added to more_diseases, because more_diseases in not an independent list but is merely a second pointer to the same list that diseases points to.

['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'stem_blight', 'wilt', 'crown_galls']

Copy diseases to a new list with a name of your choice – the new list should not simply be a pointed to the old one, but a different object in memory. Then, remove all items from the new list. Check if diseases still contains its items – if not, you’ll have to try again!

Hints

To create a new list, use the copy() method or the [:] notation.

Solution

To create a copy, use the copy() method:

Or the [:] notation.

Then, to remove all elements in the copy of the list:

['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'stem_blight', 'wilt', 'crown_galls']

What fundamental difference between lists and strings makes it so that newstring = oldstring creates a new string, whereas newlist = oldlist simply creates a new pointer to the same list?

Solution

The fact that strings are immutable, whereas lists are mutable.

Bonus: Get all unique characters (not items) present in diseases.

Hints

Remember how we can turn a list into a string with join()? If you specify "" as the separator, it will simply concatenate all the items in the list.

Next, note that applying set() to a string will extract the unique characters.

Solution

First, turn the list into a string using "".join. Then, call set() on the string to get a list of unique items (= characters).

{'c', '_', 'm', 'f', 'g', 'h', 'o', 'l', 'r', 'w', 'b', 'e', 'u', 'n', 'p', 'k', 's', 'a', 'i', 't'}

Exercise CSB-1: Assortative mating

Jiang et al. (2013) studied assortative mating in animals. They compiled a large database, reporting the results of many experiments on mating. In particular, for several taxa they provide the value of correlation among the sizes of the mates. A positive value of r stands for assortative mating (large animals tend to mate with large animals), and a negative value for disassortative mating.

You can find the data in CSB/good_code/data/Jiang2013_data.csv¹. Write a function that takes as input the desired Taxon and returns the mean value of r. Then, apply that function to all taxa in the file.

Hints

Have a look at the file in the shell before you start.
To parse the file, DictReader from the csv module is again a good option, but note that you’ll have the specify the delimiter argument since the file is tab-separated.
There are several ways of going about here, but the one in the solutions is to first read in the file and create two lists: one with taxa names and one with the corresponding values for r: (These lists will also be re-used in parts 2 and 3 of the exercise, so it is recommended to follow this approach.)
```
import csv 
taxa = []
r_values = []

open the file and set up dictionary reader
for each row:
    append to taxa
    append to r_values
```

Then, the actual function will take these two lists and a taxon name as input.

def compute_avg_r(taxa, r_values, target_taxon = "Fish"):
avg_taxon = 0.0
num_occurrences = 0
cycle through the values of taxa
    every time you find the right taxon, add its r value to avg_taxon
    and increment num_occurrences
at the end, divide avg_taxon by num_occurrences and return the average

To apply the function to all taxa, use a set to get the unique taxa, and loop over the taxa.

You should have seen that fish have a positive value of r, but that this is also true for other taxa. Is the mean value of r especially high for fish? To test this, compute a p-value by repeatedly sampling 37 values of r (37 experiments on fish are reported in the database) at random, and calculating the probability of observing a higher mean value of r. To get an accurate estimate of the p-value, use 50,000 randomizations.

Hints

In part 3 of this exercise, you will repeat this procedure for other taxa, so it will be a good idea to create a function here.

Your function should take as input the target taxon name, the lists with taxa and r-values that you created in the first step, and the number of repetitions (randomizations).

In the function:

First compute the mean r value for the target taxon.
Then, iterate over the output of range() to repeat the randomizations.
In every iteration, shuffle either the list with r values or the taxon names, using the function scipy.random.shuffle() (don’t forget to import scipy).

Then, you can simply call the same function you used before to calculate the mean value of r.
In every iteration, compare the randomized mean with the observed mean, and keep a tally of the number of times the observed mean is higher.
The p-value will simply be the proportion (~= probability) of times you got a higher mean value of r among randomized sets of 37.

Pseudocode:

def compute_pvalue(taxa, r_values, target_taxon = "Fish", num_rep = 1000):
    observed_r = compute the mean for the observed average r value
    count_random_is_higher = 0.0
    for i in range(num_rep):
        shuffle the r values
        random_r = compute the mean using the shuffled values
        if random_r >= observed_r:
              increment count_random_is_higher
    now divide count_random_is_higher by num_rep (= the p-value) and return

Repeat the procedure for all taxa.

Hints

Loop over all taxa, and call the function you created in the previous part of the exercise in every iteration.

Solutions for all steps

See this Jupyter Notebook by the authors of the CSB book.

Bonus: Exercise CSB-2: Human intestinal ecosystems

Lahti et al. (2014) studied the microbial communities living in the intestines of 1,000 human individuals. They found that bacterial strains tend to be either absent or abundant, and posit that this would reflect bistability in these bacterial assemblages.

The data used in this study are contained in the directory CSB/good_code/data/Lahti2014². The directory contains:

The file HITChip.tab containing estimates of microbial abundance for each sample, as obtained by HITChip signal.
The file Metadata.tab, providing metadata about each of the 1,006 human records.
README, a description of the data by the study’s authors.

Write a function that takes as input a dictionary of constraints (i.e., selecting a specific group of records) and returns a dictionary tabulating the values for the column BMI_group for all records matching the constraints.

For example, calling:
```
get_BMI_count({"Age": "28", "Sex": "female"})
```
should return:
```
{'NA': 3, 'lean': 8, 'overweight': 2, 'underweight': 1}
```

Hints

Several strategies are again possible, but the CSB solution linked to below creates a single function will all code, including reading in the file.
Once again, DictReader() from the csv module is a useful way to read in the data.
Loop over the lines (rows) and for each row, you’ll want to check whether all conditions are satisfied.
Your output dictionary will basically be a count table of the values found for the BMI_group, column, so for each matching row, add 1 to the value for the key that represents the group (lean, etc).

Pseudocode:

def get_BMI_count(dict_constr):
    open the file and set up the csv reader
   for each row:
        add_to_count = True
        for each constrain in dict_constr:
              if constraint is not met:
                  add_to_count = False
        if add_to_count:
              all the constraints are respected
              add to the tally
   return the result

Write a function that takes as input the constraints (as above) and a bacterial “genus.” The function returns the average abundance (in logarithm base 10) of the genus for each BMI group in the subpopulation.

For example, calling:

get_abundance_by_BMI({"Time": "0",
                      "Nationality": "US"},
                      "Clostridium difficile et rel.")

should return:

------------------------------------------------
Abundance of Clostridium difficile et rel.
In subpopulation:
------------------------------------------------
Nationality -> US
Time -> 0
------------------------------------------------
3.08
NA
3.31
underweight
3.84
lean
2.89
overweight
3.31
obese
3.45
severeobese
------------------------------------------------

Hints

To write the function, you need to:

Open the file Metadata.tab, and extract the SampleID corresponding to the constraints specified by the user (you can use a list to keep track of all IDs).
Open the file HITChip.tab to extract the abundances matching the genus specified by the user (and for the ID stored in step 1).

To calculate the log value, you can use the log10 function from the scipy module (though you may get a deprecation warning; this is now supposed to be called from the numpy module, but we haven’t installed that yet.)

Pseudocode:
```
def get_abundance_by_BMI(dict_constraints, genus = 'Aerococcus'):
    open the file Metadata.tab extract matching IDs using the same 
    approach as in exercise 1
    these IDs are stored in BMI_IDs

    Now open HITChip.tab, and keep track of the abundance
    of the genus for each BMI group
    Calculate means, and print results
```

Repeat this analysis for all genera, and for the records having Time = 0.

Hints

The genera are contained in the header of the file HITChip.tab. Extract them from the file and store them in a list.

Then, you can call the function get_abundance_by_BMI({'Time': '0'}, g), where g is the genus; cycle through all genera.

Solutions for all steps

See this Jupyter Notebook by the authors of the CSB book.

If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎
If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎

Exercises for week 10

Exercise 1: More lists

Exercise CSB-1: Assortative mating

Bonus: Exercise CSB-2: Human intestinal ecosystems

Reuse