Exercises for week 10

Writing good code


Exercise 1: More lists

Start with a same list as you created in the exercises for week 8:

  1. Sort diseases in place.
Solution

The sort() method sorts a list in place:

['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'root_rot', 'stem_blight', 'wilt']

  1. Instead of sorting in place with the sort() method like in the previous step, you can also use the sorted() function, which will not sort in place but return a new, sorted list.

    Find out how to use sorted() to sort in reverse order, and apply this to diseases to create a new list diseases_sorted.

Solution

We can use the reverse argument to sorted() to sort in reverse order:

  1. If you would run fewer_diseases = diseases.remove("root_rot"), what would fewer_diseases contain? Think about what the answer should be, and then check if you were right. Does simply running fewer_diseases versus running print(fewer_diseases) make a difference?
Solution

Because remove() operates in place, it doesn’t return anything:

Well, it actualy returns None, which you can see by explicitly calling the print() function:

None

  1. If you would run:

Would the list diseases also contain crown_galls? Think about what the answer should be, and then check if you were right.

Solution

Yes, diseases will contain the item crown_galls that was added to more_diseases, because more_diseases in not an independent list but is merely a second pointer to the same list that diseases points to.

['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'stem_blight', 'wilt', 'crown_galls']
  1. Copy diseases to a new list with a name of your choice – the new list should not simply be a pointed to the old one, but a different object in memory. Then, remove all items from the new list. Check if diseases still contains its items – if not, you’ll have to try again!
Hints

To create a new list, use the copy() method or the [:] notation.

Solution

To create a copy, use the copy() method:

Or the [:] notation.

Then, to remove all elements in the copy of the list:

['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'stem_blight', 'wilt', 'crown_galls']

  1. What fundamental difference between lists and strings makes it so that newstring = oldstring creates a new string, whereas newlist = oldlist simply creates a new pointer to the same list?
Solution

The fact that strings are immutable, whereas lists are mutable.

  1. Bonus: Get all unique characters (not items) present in diseases.
Hints

Remember how we can turn a list into a string with join()? If you specify "" as the separator, it will simply concatenate all the items in the list.

Next, note that applying set() to a string will extract the unique characters.

Solution

First, turn the list into a string using "".join. Then, call set() on the string to get a list of unique items (= characters).

{'c', '_', 'm', 'f', 'g', 'h', 'o', 'l', 'r', 'w', 'b', 'e', 'u', 'n', 'p', 'k', 's', 'a', 'i', 't'}


Exercise CSB-1: Assortative mating

Jiang et al. (2013) studied assortative mating in animals. They compiled a large database, reporting the results of many experiments on mating. In particular, for several taxa they provide the value of correlation among the sizes of the mates. A positive value of r stands for assortative mating (large animals tend to mate with large animals), and a negative value for disassortative mating.

  1. You can find the data in CSB/good_code/data/Jiang2013_data.csv1. Write a function that takes as input the desired Taxon and returns the mean value of r. Then, apply that function to all taxa in the file.
Hints

  1. You should have seen that fish have a positive value of r, but that this is also true for other taxa. Is the mean value of r especially high for fish? To test this, compute a p-value by repeatedly sampling 37 values of r (37 experiments on fish are reported in the database) at random, and calculating the probability of observing a higher mean value of r. To get an accurate estimate of the p-value, use 50,000 randomizations.
Hints

In the function:

Pseudocode:

def compute_pvalue(taxa, r_values, target_taxon = "Fish", num_rep = 1000):
    observed_r = compute the mean for the observed average r value
    count_random_is_higher = 0.0
    for i in range(num_rep):
        shuffle the r values
        random_r = compute the mean using the shuffled values
        if random_r >= observed_r:
              increment count_random_is_higher
    now divide count_random_is_higher by num_rep (= the p-value) and return

  1. Repeat the procedure for all taxa.
Hints

Loop over all taxa, and call the function you created in the previous part of the exercise in every iteration.

Solutions for all steps

See this Jupyter Notebook by the authors of the CSB book.


Bonus: Exercise CSB-2: Human intestinal ecosystems

Lahti et al. (2014) studied the microbial communities living in the intestines of 1,000 human individuals. They found that bacterial strains tend to be either absent or abundant, and posit that this would reflect bistability in these bacterial assemblages.

The data used in this study are contained in the directory CSB/good_code/data/Lahti20142. The directory contains:

  1. Write a function that takes as input a dictionary of constraints (i.e., selecting a specific group of records) and returns a dictionary tabulating the values for the column BMI_group for all records matching the constraints.

    For example, calling:

    get_BMI_count({"Age": "28", "Sex": "female"})

    should return:

    {'NA': 3, 'lean': 8, 'overweight': 2, 'underweight': 1}
Hints

Pseudocode:

def get_BMI_count(dict_constr):
    open the file and set up the csv reader
   for each row:
        add_to_count = True
        for each constrain in dict_constr:
              if constraint is not met:
                  add_to_count = False
        if add_to_count:
              all the constraints are respected
              add to the tally
   return the result

  1. Write a function that takes as input the constraints (as above) and a bacterial “genus.” The function returns the average abundance (in logarithm base 10) of the genus for each BMI group in the subpopulation.

    For example, calling:

    get_abundance_by_BMI({"Time": "0",
                          "Nationality": "US"},
                          "Clostridium difficile et rel.")

    should return:

    ------------------------------------------------
    Abundance of Clostridium difficile et rel.
    In subpopulation:
    ------------------------------------------------
    Nationality -> US
    Time -> 0
    ------------------------------------------------
    3.08
    NA
    3.31
    underweight
    3.84
    lean
    2.89
    overweight
    3.31
    obese
    3.45
    severeobese
    ------------------------------------------------
Hints

To write the function, you need to:

  1. Open the file Metadata.tab, and extract the SampleID corresponding to the constraints specified by the user (you can use a list to keep track of all IDs).

  2. Open the file HITChip.tab to extract the abundances matching the genus specified by the user (and for the ID stored in step 1).

    To calculate the log value, you can use the log10 function from the scipy module (though you may get a deprecation warning; this is now supposed to be called from the numpy module, but we haven’t installed that yet.)

    Pseudocode:

    def get_abundance_by_BMI(dict_constraints, genus = 'Aerococcus'):
        open the file Metadata.tab extract matching IDs using the same 
        approach as in exercise 1
        these IDs are stored in BMI_IDs
    
        Now open HITChip.tab, and keep track of the abundance
        of the genus for each BMI group
        Calculate means, and print results

  1. Repeat this analysis for all genera, and for the records having Time = 0.
Hints

The genera are contained in the header of the file HITChip.tab. Extract them from the file and store them in a list.

Then, you can call the function get_abundance_by_BMI({'Time': '0'}, g), where g is the genus; cycle through all genera.

Solutions for all steps

See this Jupyter Notebook by the authors of the CSB book.



  1. If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎

  2. If necessary, download the CSB repository again using git clone https://github.com/CSB-book/CSB.git↩︎

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".