Writing good code
Start with a same list as you created in the exercises for week 8:
diseases
in place.
The sort()
method sorts a list in place:
['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'root_rot', 'stem_blight', 'wilt']
Instead of sorting in place with the sort()
method like in the previous step, you can also use the sorted()
function, which will not sort in place but return a new, sorted list.
Find out how to use sorted()
to sort in reverse order, and apply this to diseases
to create a new list diseases_sorted
.
We can use the reverse
argument to sorted()
to sort in reverse order:
fewer_diseases = diseases.remove("root_rot")
, what would fewer_diseases
contain? Think about what the answer should be, and then check if you were right. Does simply running fewer_diseases
versus running print(fewer_diseases)
make a difference?
Because remove()
operates in place, it doesn’t return anything:
Well, it actualy returns None
, which you can see by explicitly calling the print()
function:
None
Would the list diseases
also contain crown_galls
? Think about what the answer should be, and then check if you were right.
Yes, diseases
will contain the item crown_galls
that was added to more_diseases
, because more_diseases
in not an independent list but is merely a second pointer to the same list that diseases
points to.
['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'stem_blight', 'wilt', 'crown_galls']
diseases
to a new list with a name of your choice – the new list should not simply be a pointed to the old one, but a different object in memory. Then, remove all items from the new list. Check if diseases
still contains its items – if not, you’ll have to try again!
To create a new list, use the copy()
method or the [:]
notation.
To create a copy, use the copy()
method:
Or the [:]
notation.
Then, to remove all elements in the copy of the list:
['canker', 'fruit_rot', 'leaf_blight', 'leaf_spots', 'root_knot', 'stem_blight', 'wilt', 'crown_galls']
newstring = oldstring
creates a new string, whereas newlist = oldlist
simply creates a new pointer to the same list?
The fact that strings are immutable, whereas lists are mutable.
diseases
.
Remember how we can turn a list into a string with join()
? If you specify ""
as the separator, it will simply concatenate all the items in the list.
Next, note that applying set()
to a string will extract the unique characters.
First, turn the list into a string using "".join
. Then, call set()
on the string to get a list of unique items (= characters).
{'c', '_', 'm', 'f', 'g', 'h', 'o', 'l', 'r', 'w', 'b', 'e', 'u', 'n', 'p', 'k', 's', 'a', 'i', 't'}
Jiang et al. (2013) studied assortative mating in animals. They compiled a large database, reporting the results of many experiments on mating. In particular, for several taxa they provide the value of correlation among the sizes of the mates. A positive value of r
stands for assortative mating (large animals tend to mate with large animals), and a negative value for disassortative mating.
CSB/good_code/data/Jiang2013_data.csv
1. Write a function that takes as input the desired Taxon and returns the mean value of r
. Then, apply that function to all taxa in the file.
Have a look at the file in the shell before you start.
To parse the file, DictReader
from the csv
module is again a good option, but note that you’ll have the specify the delimiter
argument since the file is tab-separated.
There are several ways of going about here, but the one in the solutions is to first read in the file and create two lists: one with taxa names and one with the corresponding values for r
: (These lists will also be re-used in parts 2 and 3 of the exercise, so it is recommended to follow this approach.)
import csv
= []
taxa = []
r_values
open the file and set up dictionary reader
for each row:
append to taxa append to r_values
Then, the actual function will take these two lists and a taxon name as input.
def compute_avg_r(taxa, r_values, target_taxon = "Fish"):
= 0.0
avg_taxon = 0
num_occurrences
cycle through the values of taxa
every time you find the right taxon, add its r value to avg_taxonand increment num_occurrences
and return the average at the end, divide avg_taxon by num_occurrences
To apply the function to all taxa, use a set to get the unique taxa, and loop over the taxa.
r
, but that this is also true for other taxa. Is the mean value of r
especially high for fish? To test this, compute a p-value by repeatedly sampling 37 values of r
(37 experiments on fish are reported in the database) at random, and calculating the probability of observing a higher mean value of r
. To get an accurate estimate of the p-value, use 50,000 randomizations.
In part 3 of this exercise, you will repeat this procedure for other taxa, so it will be a good idea to create a function here.
Your function should take as input the target taxon name, the lists with taxa and r
-values that you created in the first step, and the number of repetitions (randomizations).
In the function:
First compute the mean r
value for the target taxon.
Then, iterate over the output of range()
to repeat the randomizations.
In every iteration, shuffle either the list with r
values or the taxon names, using the function scipy.random.shuffle()
(don’t forget to import scipy
).
Then, you can simply call the same function you used before to calculate the mean value of r
.
In every iteration, compare the randomized mean with the observed mean, and keep a tally of the number of times the observed mean is higher.
The p-value will simply be the proportion (~= probability) of times you got a higher mean value of r
among randomized sets of 37.
Pseudocode:
def compute_pvalue(taxa, r_values, target_taxon = "Fish", num_rep = 1000):
observed_r = compute the mean for the observed average r value
count_random_is_higher = 0.0
for i in range(num_rep):
shuffle the r values
random_r = compute the mean using the shuffled values
if random_r >= observed_r:
increment count_random_is_higher
now divide count_random_is_higher by num_rep (= the p-value) and return
Loop over all taxa, and call the function you created in the previous part of the exercise in every iteration.
Lahti et al. (2014) studied the microbial communities living in the intestines of 1,000 human individuals. They found that bacterial strains tend to be either absent or abundant, and posit that this would reflect bistability in these bacterial assemblages.
The data used in this study are contained in the directory CSB/good_code/data/Lahti2014
2. The directory contains:
The file HITChip.tab
containing estimates of microbial abundance for each sample, as obtained by HITChip signal.
The file Metadata.tab
, providing metadata about each of the 1,006 human records.
README
, a description of the data by the study’s authors.
Write a function that takes as input a dictionary of constraints (i.e., selecting a specific group of records) and returns a dictionary tabulating the values for the column BMI_group
for all records matching the constraints.
For example, calling:
"Age": "28", "Sex": "female"}) get_BMI_count({
should return:
'NA': 3, 'lean': 8, 'overweight': 2, 'underweight': 1} {
Several strategies are again possible, but the CSB solution linked to below creates a single function will all code, including reading in the file.
Once again, DictReader()
from the csv
module is a useful way to read in the data.
Loop over the lines (rows) and for each row, you’ll want to check whether all conditions are satisfied.
Your output dictionary will basically be a count table of the values found for the BMI_group
, column, so for each matching row, add 1 to the value for the key that represents the group (lean
, etc).
Pseudocode:
def get_BMI_count(dict_constr):
open the file and set up the csv reader
for each row:
add_to_count = True
for each constrain in dict_constr:
if constraint is not met:
add_to_count = False
if add_to_count:
all the constraints are respected
add to the tally
return the result
Write a function that takes as input the constraints (as above) and a bacterial “genus.” The function returns the average abundance (in logarithm base 10) of the genus for each BMI group in the subpopulation.
For example, calling:
"Time": "0",
get_abundance_by_BMI({"Nationality": "US"},
"Clostridium difficile et rel.")
should return:
------------------------------------------------
Abundance of Clostridium difficile et rel.
In subpopulation:------------------------------------------------
-> US
Nationality -> 0
Time ------------------------------------------------
3.08
NA3.31
underweight3.84
lean2.89
overweight3.31
obese3.45
severeobese------------------------------------------------
To write the function, you need to:
Open the file Metadata.tab
, and extract the SampleID
corresponding to the constraints specified by the user (you can use a list to keep track of all IDs).
Open the file HITChip.tab
to extract the abundances matching the genus specified by the user (and for the ID stored in step 1).
To calculate the log value, you can use the log10
function from the scipy
module (though you may get a deprecation warning; this is now supposed to be called from the numpy
module, but we haven’t installed that yet.)
Pseudocode:
def get_abundance_by_BMI(dict_constraints, genus = 'Aerococcus'):
open the file Metadata.tab extract matching IDs using the same
approach as in exercise 1
these IDs are stored in BMI_IDs
Now open HITChip.tab, and keep track of the abundance
of the genus for each BMI group
Calculate means, and print results
Time = 0
.The genera are contained in the header of the file HITChip.tab
. Extract them from the file and store them in a list.
Then, you can call the function get_abundance_by_BMI({'Time': '0'}, g)
, where g is the genus; cycle through all genera.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".