class:inverse middle center # *Week 10 - Writing good code* ---- # I: Modules, writing functions, <br> and program structure <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/03/16 --- ## Overview of this week .pull-left[ **This presentation - CSB 4.2:** - Modules - Writing your own functions - Program structure ] .pull-right[ **Thursday's presentation:** - Running Python scripts (4.4) - Errors and error handling (4.5) - Beyond the basics (4.9) ] <br> <br> .content-box-info[ **Not discussed - CSB sections on:** - Writing style (4.3) - Debugging (4.6), Unit testing (4.7), and Profiling (4.8) ] --- class: center middle inverse # Modules ----- <br> <br> <br> <br> --- ## Modules and packages - **Python modules** contain *collections of functions*. Technically, a Python module is a single `.py` file. - **Packages** are *collections of modules* packaged together (though the term module is in practice also used for packages). <br> Use the **`import` statement** to import modules and packages into your Python environment, be that in a script or an interactive environment: ```py import os ``` <br> .content-box-info[ The term **library** is not as well-defined, and can be used interchangeably with large packages, but is also used to describe a repository of modules. ] --- ## Modules from the Python standard library Modules from the **Python standard library** are part of any Python installation, but do need to be *loaded* before you can use their functionality. For example: `csv` (working with CSV files), `os` (interfacing with the OS), and `re` (using regular expressions – week after next). --- ## Installing modules/packages So-called "**third-party**" modules/packages do need to be installed. These are not installed from inside Python, but using stand-alone package managers that can be called in the shell: - Conda, which we have already worked with – for instance, to install the `numpy` ("NUMerical PYthon") library into an active Conda environment: ```sh $ conda install numpy # Shell ``` - Another common package manager is the Python-specific `pip`: ```sh $ pip install --user numpy # Shell ``` <br> .content-box-info[ At OSC, Conda is generally the preferred method of managing Python modules. ] --- ## Installing modules/packages .content-box-diy[ **Let's install a couple of packages, which we'll need this week and especially next week into our Conda environment `ipy-env`.** Open a terminal in VS Code, type `bash` to break out of the container, and: ```sh sinteractive -A PAS1855 -t 60 # Start an interactive job module load python/3.6-conda5.2 # Load the OSC Conda module source activate ipy-env # Activate your "ipy-env" # Finally, install the Python packages: conda install -y numpy pandas scipy biopython matplotlib ``` ] --- ## Loading modules In Python, it is customary **not** to load modules directly into your "namespace". This means that the module's functions will not be accessibly by just their function names: instead, you need to **prepend the module name**. -- - For example, simply using the module name as in `import mymodule` will load all of the module's functions, which can then be referred to using `mymodule.myfunction`: ```py import os os.getcwd() ``` -- - If a module's name is long(ish), it is common to use `as` to define an "alias" for the module, so the function calls can be shorter: ```py import numpy as np np.array([1, 2, 3, 4]) ``` *(While the alias name can be anything you like, most modules have commonly-used aliases, like `np` for `numpy`.)* --- ## Loading modules (cont.) - To load only certain functions from a module, or to load specific modules from multi-module packages: ```py # From the snakemake.io module, import two functions: from snakemake.io import expand, glob_wildcards # From the matplotlib package, import the pyplot module: from matplotlib import pyplot as plt # Or equivalently using <package>.<module>: import matplotlib.pyplot as plt ``` <br> -- .content-box-info[ Finally, it is *possible yet discouraged* to load all functions from a module directly into your namespace using an asterisk `*` wildcard: ```py from numpy import * array([1, 2, 3, 4]) # Now, no need for "numpy." / "np." prefix ``` ] --- ## Creating modules **In Python, it is exceedingly easy to create a module:** Just save a file with functions (e.g. `my_utils.py`) and load it using: ```python import my_utils ``` *(This works assuming the file is in the same directory or can otherwise be located by Python.)* We'll do this in a little bit with our popgen simulation! --- class: center middle inverse # Writing your own Python functions ----- <br> <br> <br> <br> --- ## Writing our first function A function to calculate the **GC content of a DNA sequence** provided as a string (the GC content is the proportion of bases that are either Gs or Cs): ```py def GC_content(dna): # Make sure the input string is all uppercase: dna = dna.upper() # Count the occurrences of each nucleotide: numG = dna.count("G") numC = dna.count("C") numA = dna.count("A") numT = dna.count("T") # Finally, calculate (G + C) / (A + T + G + C): GC = (numG + numC) / (numG + numC + numT + numA) return GC ``` .content-box-info[ Create a "cell" in your Python script using `# %%` – now, <kbd>Shift</kbd>+<kbd>Enter</kbd> will execute the entire cell, and create a new one below it. ] --- ## Writing our first function (cont.) ```py def GC_content(dna): # Arbitrary code... return GC ``` The anatomy of a Python function: - Start with the **keyword `def`** (for *define*) - Next, provide the **name of the function**. This can be whatever we like – though it is highly recommended to be descriptive! - Between parentheses, list the **argument(s)** (input) of the function. - Don't forget the **colon `:`** before you start the body of the function! - All the code belonging to the function is indented. - The function returns output as defined after the **`return` keyword**. --- ## Using our first function ```python GCcontent("AAATTTTG") #> 0.125 GCcontent("ATGCATGCATGC") #> 0.5 ``` Excellent! --- ## Why are custom functions useful? -- Custom functions are useful to break up your code in small(ish) units. This way, the main part of your program (script) can become a series of logically connected function calls, and the nuts and bolts are in the individual functions. -- <br> **Having modular code with functions:** - Makes it easier to see what your program does. - Makes it easier to test your code. - Makes it easier to repeat operations (like calculating GC content for multiple sequences). - Makes your programs more flexible – thanks to usage of function *arguments*. --- ## Another simple function A function to calculate the square for a series of numbers, and compile those squares in a list: ```python def squared(start = 1, end = 10): # We provide default values! # Create empty list to catch result of each iteration: results = [] # Loop through the numbers: for i in range(start, end): i_square = i ** 2 # Calculate square results.append(i_square) # Append current value to list return results ``` -- Let's try it: ```python squared(start = 3, end = 10) ``` -- ```python #> [9, 16, 25, 36, 49, 64, 81] ``` --- ## Arguments with default values ```python def squared(start = 1, end = 10): ... ``` Because we specified default values in the function definition, we can omit one or both! ```python # If we only specify "start", then "end" has default value 10: squared(5) #> [25, 36, 49, 64, 81] # If we only specify "end", then "start" has default value 1: squared(end = 3) #> [1, 4] # No arguments provided: start=1, end=10: squared() #> [1, 4, 9, 16, 25, 36, 49, 64, 81] ``` --- ## Variable scope The "scope" of a variable refers to the environment in which it can be accessed: - When a variable has **"global scope"**, it is available anywhere in the program unless/until set to another value. This is Python's behavior for variables set anywhere *outside of functions*. - When a variable has **"function scope"**, it is available only inside the function. This is the Python's default behavior *inside functions*. Let's see this in action: ```python results = [5, 10] results #> [5, 10] # Our squared() function collects results in the list "results" squared() #> [1, 4, 9, 16, 25, 36, 49, 64, 81] # However, after calling the function, results has been unchanged: results #> [5, 10] ``` --- ## Variable scope The behavior of limiting the scope of variables in functions is useful because: -- - It means you don't have to worry about **variable names overlapping** with the rest of the program. - It ensures functions don't have **"*side effects*"** apart from the single object they return: recall, a function exits after a statement with `return`. *(It is possible to set variables with a global scope in functions, using the `global` keyword, but this is not generally recommended.)* -- .content-box-info[ So, functions are supposed to be their "own little words": - They *should* only use objects that are explicitly passed to the function as arguments: **independence from the global environment**. - They *should* not modify anything in the global environment except the single object that they return: **no side effects to the global environment**. ] --- ## <i class="fa fa-user-edit"></i> Intermezzo 4.1: <br> Determine the function of the function ```python def foo2(x = 3, y = 5): if x > y: return x return y ``` -- `foo2()` returns the larger of two numbers. --- ## <i class="fa fa-user-edit"></i> Intermezzo 4.1: <br> Determine the function of the function (cont.) ```python def foo4(x = 6): result = 1 for i in range(1, x + 1): result = result * i return result ``` -- `foo4()` returns the factorial of a number (`6!` => `6 * 5 * 4 * 3 * 2 * 1`). <br> -- ```python def foo6(x = 25): if x == 1: return 1 return x * foo6(x - 1) ``` -- `foo6()` returns the factorial of a number via *recursion*. --- ## <i class="fa fa-user-edit"></i> A function to convert temperatures - Define a function that converts from temperatures in degrees Fahrenheit to degrees celcius. (To convert Fahrenheit to Celcius, subtract 32 and then multiply by 5/9.) - Test your function with any temperature. --- ## <i class="fa fa-user-edit"></i> A function to convert temperature: Solution ```python def convert_to_celcius(deg_f): deg_c = (deg_f - 32) * (5 / 9) return deg_c ``` ```python #> convert_to_celcius(100) 37.77777777777778 ``` --- class: center middle inverse # Program structure ----- ### Practicing coding a more complex program <br> with a population genetic simulation <br> <br> <br> <br> --- ## The outline of the program We will simulate genetic drift: random changes in allele frequency over time that happen in finite populations. **We'll ask how long it takes before one of the two alleles at a single locus (`A` or `a`) drifts to fixation.** -- <br> To do so, we'll divide our code into several functions: - A function that **initializes the population**. It takes as input the size of the population (`N`), and the probability of having an `A` allele (`p`), and returns an entire "population", i.e. a collection of diploid genotypes that represent individuals. - A function that **computes genotype frequencies**, which we will need to determine whether an allele has gone to fixation. The function takes a population as input and returns the count for each genotype. - A function that takes the current population and **produces the next generation**. <br> --- ## We'll put our functions in a separate script <br> Start by opening a new file and saving it as `drift.py`. --- ## The function to build the population ```python import numpy as np # for random numbers def build_population(N, p): """The population consists of N individuals. Each individual has two chromosomes, containing allele "A" or "a", with probability p and 1-p, respectively. The population is a list of tuples. """ population = [] # Start with empty list for i in range(N): # Loop over individuals allele1 = "A" # The "default" allele is A if np.random.random() > p: # Random number between 0 and 1 allele1 = "a" allele2 = "A" # Second chromosome (diploidy) if np.random.random() > p: allele2 = "a" population.append((allele1, allele2)) return population ``` --- ## Use Docstrings to document your functions! .content-box-info[ Note the description of the function provided in a special multiline comment (triple quotes), which is called a **"docstring"** (documentation string). ```python def build_population(N, p): """The population consists of N individuals. Each individual has two chromosomes, containing allele "A" or "a", with probability p and 1-p, respectively. The population is a list of tuples. """ ``` <br> We can see this documentation, for instance, if call the help for our function: ```python >>> build_population? >>> help(build_population) ``` ] --- ## Test the function (not in the script) With a `p` of 0.5, we expect roughly equal proportions of `A` and `a` and differences between two instantiations: ```python >>> build_population(N = 5, p = 0.5) #> [('a', 'a'), ('a', 'A'), ('A', 'A'), ('A', 'a'), ('A', 'A')] >>> build_population(N = 5, p = 0.5) #> [('A', 'a'), ('A', 'a'), ('a', 'a'), ('A', 'A'), ('A', 'a')] ``` -- <br> With a `p` other than 0.5, we should see uneven numbers of `a`s and `A`s: ```python >>> build_population(N = 5, p = 0.9) #> [('A', 'A'), ('A', 'A'), ('A', 'a'), ('A', 'A'), ('A', 'A')] >>> build_population(N = 5, p = 0) #> [('a', 'a'), ('a', 'a'), ('a', 'a'), ('a', 'a'), ('a', 'a')] ``` --- ## The function to calculate genotype frequencies There are four possible diploid genotypes, so our next function should simply count the number occurrences of each genotype: ```python def compute_frequencies(population): """ Count the genotypes. Returns a dictionary with counts for each genotype. """ AA = population.count(("A", "A")) Aa = population.count(("A", "a")) aA = population.count(("a", "A")) aa = population.count(("a", "a")) return({"AA": AA, "aa": aa, "Aa": Aa, "aA": aA}) ``` --- ## Test the function (not in the script) ```python >>> my_pop = build_population(6, 0.5) >>> compute_frequencies(my_pop) #> {'AA': 1, 'aa': 1, 'Aa': 2, 'aA': 2} ``` --- ## The function to create a new generation ```python def reproduce_population(population): """ Create a new generation through sexual reproduction For each of N new offspring: - Choose the parents at random - The offspring receives one chromosome from each parent """ new_generation = [] N = len(population) for i in range(N): dad = np.random.randint(N) # Random int between 0 & N-1 mom = np.random.randint(N) # (to pick indiv. from pop) chr_mom = np.random.randint(2) offspring = (population[mom][chr_mom], population[dad][1 - chr_mom]) new_generation.append(offspring) return(new_generation) ``` --- ## The function to create a new generation ```python def reproduce_population(population): """ Create a new generation through sexual reproduction For each of N new offspring: - Choose the parents at random - The offspring receives one chromosome from each parent """ new_generation = [] # Start with an empty list N = len(population) # Infer N (no need for an arg) for i in range(N): # Run through individuals 1 at a time dad = np.random.randint(N) # Random int between 0 & N-1 mom = np.random.randint(N) # (to pick indiv. from pop) chr_mom = np.random.randint(2) # Pick chromosome: 1 or 0 offspring = (population[mom][chr_mom], # 1 chr. from mom population[dad][1 - chr_mom]) # 1 chr. from dad new_generation.append(offspring) # Add to master list return(new_generation) ``` --- ## Test the function (not in the script) ```python >>> reproduce_population(my_pop) #> [('A', 'A'), ('A', 'A'), ('A', 'A'), ('A', 'a'), ('A', 'a'), ('a', 'A')] ``` --- ## A new script for the main program - Open a new file and save it as `simulate_drift.py`. - In it, we will first import our functions: ```python import drift ``` <br> <br> <br> .content-box-diy[ If your Conda installation of additional Python packages failed or is still running, use a variant with SciPy functions instead: ```sh # In the shell: cd /fs/ess/PAS1855/users/$USER/week10 cp /fs/ess/PAS1855/users/jelmer/week10/drift_scipy.py drift.py ``` ] --- ## Get help for our own module! ```python >>> help(drift) #> Help on module drift: #> #> NAME #> drift #> #> FUNCTIONS #> build_population(N, p) #> The population consists of N individuals. #> Each individual has two chromosomes, containing #> allele "A" or "a", with probability p and 1-p, #> respectively. #> #> The population is a list of tuples. #> #> compute_frequencies(population) #> Count the genotypes. #> Returns a dictionary with counts for each genotype. #> #> reproduce_population(population) #> Create a new generation through sexual reproduction #> For each of N new offspring: #> ... ``` --- ## The main program, which we will also put in a function ```python def simulate_drift(N, p): # Initialize the population: my_pop = drift.build_population(N, p) num_generations = 0 # To keep track of the generation number fixation = False # Stop when fixation == True while fixation == False: # Compute genotype counts: genotype_counts = drift.compute_frequencies(my_pop) # If one allele went to fixation, stop: if genotype_counts["AA"] == N or genotype_counts["aa"] == N: print("An allele reached fixation at generation", num_generations) print("The genotype counts are") print(genotype_counts) fixation == True break # If not, reproduce: my_pop = drift.reproduce_population(my_pop) num_generations = num_generations + 1 ``` --- ## Test the program ```python >>> simulate_drift(100, 0.5) #> An allele reached fixation at generation 66 #> The genotype counts are #> { ' aa ' : 100, ' aA ' : 0, ' AA ' : 0, ' Aa ' : 0} >>> simulate_drift(100, 0.9) #> An allele reached fixation at generation 20 #> The genotype counts are #> { ' aa ' : 0, ' aA ' : 0, ' AA ' : 100, ' Aa ' : 0} >>> simulate_drift(10, 0.5) #> An allele reached fixation at generation 7 #> The genotype counts are #> {'AA': 0, 'aa': 10, 'Aa': 0, 'aA': 0} ``` --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- class: center middle inverse # Bonus material ----- <br> <br> <br> <br> --- background-color: #f2f5eb ## Saving and loading Python objects with `pickle` It is possible to save Python objects to file "as is". This can be handy if the object took a long time to compute, you need to use it again in a different Python session, and storage in a plain text file is not optimal. `pickle` will create binary files from Python objects that can later be loaded. - To save an object as a `pickle` file (the `.pickle` extension is not needed but makes it clear what the file is): ```python import pickle # "wb": Write in Binary mode pickle.dump(my_pop, open("population.pickle", "wb")) ``` - To load the object in the `pickle` file: ```python # "rb": Read in Binary mode population = pickle.load(open("population.pickle", "rb")) ``` --- background-color: #f2f5eb ## Removing some redundancy from the main function: ```python def simulate_drift(N, p): my_pop = drift.build_population(N, p) num_generations = 0 #fixation = False #while fixation == False: while True: genotype_counts = drift.compute_frequencies(my_pop) if genotype_counts["AA"] == N or genotype_counts["aa"] == N: print("An allele reached fixation at generation", num_generations) print("The genotype counts are") print(genotype_counts) #fixation == True break my_pop = drift.reproduce_population(my_pop) num_generations = num_generations + 1 ``` --- background-color: #f2f5eb ## Another simple function A function to nicely print a dictionary: ```python def print_dictionary(mydic): for k, v in mydic.items(): print("key: ", k, " value: ", v) ``` <br> -- Let's try it: ```python print_dictionary({"a": 3.4, "b": [1, 2, 3, 4], "c": "astring"}) #> key: a value: 3.4 #> key: b value: [1, 2, 3, 4] #> key: c value: astring ``` --- background-color: #f2f5eb ## Keyword arguments can be passed in any order ```python def squared(start = 1, end = 10): ... ``` With these so-called "keyword arguments", we can also reverse the order of the arguments, as long as we are calling them by name: ```python squared(end = 5, start = 2) #> [4, 9, 16] squared(5, 2) #> [] ```