class:inverse middle center # *Week 8 - First steps with Python* ---- # II: Data structures <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/03/04 --- class: inverse middle center # Overview ---- .left[ - ### [Lists](#lists) - ### [Dictionaries](#dicts) - ### [Tuples](#tuples) - ### [Sets](#sets) ] --- class: center middle inverse name: lists # Lists ----- <br> <br> <br> <br> --- ## Lists **A list stores a collection of items, similar to arrays in Bash.** A list is denoted with square brackets `[]`, and with individual items (elements) separated by commas. - Two ways to create an empty list: ```python new_list = [] another_list = list() ``` - Create a list with values – note that they don't have to be of the same type: ```python first_list = ["a", "e", "g"] second_list = [45, 33, 99] third_list = [3, 2.44, "green", True] ``` -- - We can also convert other object types, such as a string, into a list: ```python a = list("0123456789") a #> ['0' ,'1', '2', '3', '4', '5', '6', '7', '8', '9'] ``` --- ## Accessing list elements – and zero-indexing - We can access items within a list by indicating position(s) within square brackets. For instance: ```python my_list = ['a', 'b', 'c', 'd', 'e'] my_list[1] #> 'b' ``` But this returns the *second* item! -- - **Python starts counting at 0**, so the first element corresponds to index 0: ```python my_list[0] #> 'a' ``` --- ## Accessing list elements – and zero-indexing (cont.) - Moreover, when using *ranges*, such as with a colon `:`, **the item corresponding to the last index is not included:** ```python # my_list = ['a', 'b', 'c', 'd', 'e'] my_list[0:3] # Elements 1, 2 and 3 #> ['a', 'b', 'c'] my_list[1:3] # Elements 2 and 3 #> ['b', 'c'] ``` -- - Therefore, a range of two successive indices returns a single-item list: ```python my_list[0:1] # First item in single-element list # ['a'] # Use a single index to return a "str": my_list[0] # First item as a "str" #> 'a' ``` --- ## Zero-indexing with half-open intervals <br> <figure> <p align="center"> <img src=img/indexing1.svg width="100%"> </p> </figure> --- ## Zero-indexing with half-open intervals <br> <figure> <p align="center"> <img src=img/indexing2.svg width="100%"> </p> </figure> --- ## More indexing with a colon (i.e., "slicing") - When using a colon to indicate a range, as we did in the previous slide, numbers on either side of the colon are optional: ```python # my_list = ['a', 'b', 'c', 'd', 'e'] # From the first element until element 3 (noninclusive): my_list[:3] #> ['a', 'b', 'c'] # From element 4 to the last element: my_list[3:] #> ['d', 'e'] ``` -- - As it stands to reason, we can therefore return the whole list with `[:]`: ```python my_list[:] #> ['a', 'b', 'c', 'd', 'e'] ``` --- ## Indexing with negative numbers - You can use *negative numbers* to index starting from the end: ```python # my_list = ['a', 'b', 'c', 'd', 'e'] my_list[-1] # Last element #> 'e' my_list[-2] # Second-to-last element #> 'd' ``` - And this works with ranges as well - the default direction is still positive: ```python my_list[-3:-1] # Third-to-last to last element (noninclusive) #> ['c', 'd'] ``` --- ## Indexing with negative numbers <figure> <p align="center"> <img src=img/indexing3.svg width="100%"> </p> </figure> --- ## Indexing with negative numbers <figure> <p align="center"> <img src=img/indexing4.svg width="100%"> </p> </figure> --- ## Indexing with two colons When using **two colons**, the third element is the direction and stride. - The default stride is `1` (positive 1): ```python # my_list = ['a', 'b', 'c', 'd', 'e'] my_list[:3:1] #> ['a', 'b', 'c'] ``` - We can take every second element with a stride of `2`: ```python my_list[::2] #> ['a', 'c', 'e'] ``` - And we can reverse the list with a stride of `-1`: ```python my_list[::-1] # Go backward with stride 1 and take whole list #> ['e', 'd', 'c', 'b', 'a'] ``` --- ## Chaining indexing - Since we can also index strings: ```python seq = 'GTACAG' seq[3] #> C ``` - ...we can use index chaining to, for example, index a string inside a list: ```python seqlist = ['AACGT', 'GTACAG', 'CTCTA'] seqlist[1][3] #> C ``` --- ## Bad indexing - When your index attempt contains a contradiction between the *implicit* direction (to:from) and the *explicit* direction (stride), an empty list is returned: ```python my_list[-1:-3] # With no stride indicated, it defaults to 1 #> [] ``` -- - When your indexing range reaches beyond the last element, everything until the last element will simply be returned: ```python my_list[0:10] #> ['a', 'b', 'c', 'd', 'e'] ``` -- - But when you try to access an individual nonexistent element, you will get an error: ```python my_list[5] #> ... #> IndexError: list index out of range ``` --- ## Operating on lists with `+` and `*` - Like strings, we can concatenate two lists with `+`: ```python # my_list = ['a', 'b', 'c', 'd', 'e'] # a = ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] a + my_list #> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'a', 'b', 'c', 'd', 'e'] ``` -- - We *can't* add individual elements like this: ```python a + 1 #> TypeError: can only concatenate list (not "int") to list ``` - But we can add a list that we define on the fly, including a single-element one: ```python a + [1] #> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 1] ``` --- ## Operating on lists with `+` and `*` - Using the multiply operator, the list contents will be *repeated* n times: ```python a * 2 #> ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9'] samples = ['red'] * 5 samples # ['red', 'red', 'red', 'red', 'red'] ``` -- <br> .content-box-info[ Like the `+` operator, the `*` operator works equivalently on lists and strings: ```python my_repeat = "AGAAG" my_repeat * 3 #> 'AGAAGAGAAGAGAAG' ``` ] --- ## Side note: No automatic element-wise actions .content-box-info[ As the addition and multiplication of lists above already hinted at, Python operations do not by default act element-wise (unlike in R, where many operations are *vectorized*). To operate on each element of a list, we can: - Use a loop (next week), a *list comprehension* (next week), or the `map()` function. - Use *arrays* from the NumPy module to get R-like behavior. ] --- name: resume-here ## Setup for Tuesday Mar 9 (same as before) - **In the OnDemand form for VS Code, set the Working Directory to `/fs/ess/PAS1855/users/<your-username>`.** - Open your `week08.py` from last Thu (or start a new file). - A Python environment should load: see the bottom bar. If it's not the Python in your Conda environment (`Python 3.9.2 64-bit (ipy-env: conda)`), click the bottom bar and select it. - If you have no code in the file, type a dummy line like `5+5`. - Right-click in the editor/script and select "Run Selection/Line in Python Interactive Window". *(Or <kbd>Shift</kbd>+<kbd>Enter</kbd>.)* .content-box-info[ Is this doesn't work, we can troubleshoot after class again – for now, open a terminal in VS Code and start a regular Python prompt with: ```sh $ python3 ``` ] --- ## Recap - we talked about: - Using Python as a calculator - Variables and their types - Working with strings - Methods and functions - Our first data structure: lists - Indexing <figure> <p align="center"> <img src=img/donald_knuth.png width="90%"> https://xkcd.com/163/ </p> </figure> --- ## Modifying lists - We can *update* an element using index notation: ```python my_list = ['a', 'b', 'c', 'd', 'e'] my_list[0] = "x" my_list #> ['x', 'b', 'c', 'd', 'e'] ``` -- - To append an element to the end of the list, use the method `append()`: ```python my_list.append('f') # append: single item my_list # ['x', 'b', 'c', 'd', 'e', 'f'] ``` - To remove all elements from the list, use the method `clear()`: ```python my_list.clear() my_list #> [] ``` --- ## Misc. list methods ```python seq = list("TKAAVVNFT") seq #> ['T', 'K', 'A', 'A', 'V', 'V', 'N', 'F', 'T'] ``` - `count()` – Count occurrences of a certain item in the list: ```python seq.count("V") #> 2 ``` - `index()` – Return the index corresponding to first occurrence of an item: ```python seq.index("V") #> 4 ``` --- ## Misc. list methods (cont.) - `sort()` – Sort the list: ```python # seq = ['T', 'K', 'A', 'A', 'V', 'V', 'N', 'F', 'T'] seq.sort() seq #> ['A', 'A', 'F', 'K', 'N', 'T', 'V', 'V'] ``` -- - `reverse()` – Reverse the order of the elements: ```python seq.reverse() seq #> ['V', 'V', 'T', 'N', 'K', 'F', 'A', 'A'] ``` --- ## <i class="fa fa-user-edit"></i> Intermezzo 3.2 - Part I .panelset[ .panel[.panel-name[Questions] 1. Define a list `a = [1, 1, 2, 3, 5, 8]`. 2. Extract `[5, 8]` in two different ways (with positive and negative indices). 3. Add an element (item) containing `13` at the end of the list. ] .panel[.panel-name[Solutions] 1. Define a list: ```python a = [1, 1, 2, 3, 5, 8] ``` 2. Extract `[5, 8]` in two ways: ```python a[4:] # Or a[4:6] a[-2:] ``` 3. Add an item: ```python a.append(13) ``` ] ] --- class: center middle inverse name: dicts # Dictionaries ----- <br> <br> <br> <br> --- ## Dictionaries Dictionaries are like lists in which the elements (**values**) are indexed by **keys**. - Create an empty dictionary using curly braces or the `dict()` function: ```python empty_dict_1 = {} empty_dict_2 = dict() ``` - And one with items using `key: value` notation inside curly braces: ```python GenomeSize = {"Homo sapiens": 3200.0, "Escherichia coli": 4.6, "Arabidopsis thaliana": 157.0} GenomeSize #> {'Arabidopsis thaliana ' : 157.0, #> 'Escherichia coli' : 4.6, #> 'Homo sapiens' : 3200.0} ``` - The order of the keys does not matter, and is in fact not retained as it was entered! --- ## Accessing dictionary values - In dictionaries, individual *values* cannot be accessed by referencing by index (position). Instead, we use the name of the corresponding *key*: ```python GenomeSize["Arabidopsis thaliana"] #> 157.0 ``` Note that here, we are using square brackets and not curly braces, using the syntax `dictionary_name["key_name"]` to get the corresponding value. --- ## Modifying dictionaries - New values can be added simply by assigning a new key: ```python GenomeSize["Saccharomyces cerevisiae"] = 12.1 GenomeSize #> {' Arabidopsis thaliana': 157.0, #> 'Escherichia coli': 4.6, #> 'Homo sapiens': 3200.0, #> 'Saccharomyces cerevisiae': 12.1} ``` -- - If the key-value pair already existed, nothing happens: ```python GenomeSize["Escherichia coli"] = 4.6 ``` -- - But the old value will be overwritten if the new one is different: ```python GenomeSize["Homo sapiens"] = 3201.1 GenomeSize #> {'Arabidopsis thaliana': 157.0, #> 'Escherichia coli': 4.6, #> 'Homo sapiens': 3201.1, #> 'Saccharomyces cerevisiae': 12.1} ``` --- ## Functions and methods for dictionaries - The methods **`keys()`** and **`values()`** create list-like objects from the keys and values: ```python GenomeSize.keys() #> dict_keys(['Arabidopsis thaliana', 'Saccharomyces cerevisiae']) GenomeSize.values() #> dict_values([157.0, 12.1]) ``` - And similarly **`items()`** returns each full item, i.e. each key-value pair: ```python GenomeSize.items() #> dict_items([('Arabidopsis thaliana', 157.0), ('Saccharomyces cerevisiae', 12.1)]) ``` --- ## Functions and methods for dictionaries (cont.) - We can merge two dictionaries using `update()`: ```python D1 = {"a": 1, "b": 2, "c": 3} D2 = {"a": 2, "d": 4, "e": 5} D1.update(D2) D1 #> {'d': 4, 'e': 5, 'b': 2, 'a': 2, 'c': 3} ``` With `update()`, the order matters: for keys present in both dictionaries but with different values, the value of the *second* dictionary will overwrite that of the first. --- ## Functions and methods for dictionaries (cont.) - We saw that we can extract a value with square brackets, but we can also use the `get()` method. This will make a difference when the key does not exist – and we can even specify a default value with `get()`: ```python GenomeSize["Mus musculus"] #> KeyError: 'Mus musculus' GenomeSize.get("Mus musculus") #> GenomeSize.get("Mus musculus", -10) #> -10 ``` -- .content-box-info[ In the second example, nothing was printed, but `get()` actually returned `None`, Python's keyword to define a null value. We can see this by using the `print()` function: ```python print(GenomeSize.get("Allium cepa")) #> None ``` ] --- ## More complex dictionaries - Values of dictionaries can also be lists: ```python codon_dict = {"Phe": ['UUU', 'UUC'], "Val": ['GUU', 'GUC', 'GUA', 'GUG'], "Met": ['AUG']} ``` - And values of dictionaries can also be dictionaries themselves: ```python RNA_dict = {"bases": ['A', 'U', 'C', 'G']} RNA_dict["codons"] = codon_dict RNA_dict RNA_dict #> {'bases': ['A', 'U', 'C', 'G'], #> 'codons': {'Phe': ['UUU', 'UUC'], #> 'Val': ['GUU', 'GUC', 'GUA', 'GUG'], #> 'Met': ['AUG']}} ``` --- ## <i class="fa fa-user-edit"></i> Intermezzo 3.2 - Part II .panelset[ .panel[.panel-name[Questions] 1. Define a dictionary `codons = {'Met': 'AUG', 'Phe': 'UUU'}`. 2. Add the element `Trp` with value `"UGG"`. 3. Update the value of `Phe` to be `['UUU', 'UUC']`. ] .panel[.panel-name[Solutions] 1. Define a dictionary: ```python codons = {'Met': 'AUG', 'Phe': 'UUU'} ``` 2. Add an item: ```python codons["Trp"] = 'UGG' ``` 3. Update an item: ```python codons['Phe'] = ['UUU', 'UUC'] ``` ] ] --- class: center middle inverse name: tuples # Tuples ----- <br> <br> <br> <br> --- ## Tuples Tuples are similar to lists, in that they contain a sequence of values of any type, **but they are immutable:** the values cannot be changed once the tuple has been defined. This behavior provides write-protection, and makes tuples faster than lists. - Empty tuples are created using parentheses `()` or `tuple()`: ```python empty_tuple_1 = () empty_tuple_2 = tuple() ``` - Tuples that contain items are created using parentheses, and a trailing comma is needed when creating a single-item tuple: ```python my_tuple = (1, "two", 3) single_element_tuple = (4, ) # Trailing comma needed ``` --- ## Tuples (cont.) - Tuples can be indexed like lists: ```python my_tuple[0] #> 1 ``` - But when we try to modify a value, we get an error: ```python my_tuple[0] = 33 #> TypeError: ' tuple ' object does not support item assignment ``` --- ## Methods for tuples Because they can't be modified, tuples only have two methods: - `count()` to count the number of occurrences of an element: ```python tt = (1, 1, 1, 1, 2, 2, 4) tt.count(1) #> 4 ``` - `index()` to return the index of the first occurrence of an element: ```python tt.index(2) #> 4 ``` --- class: center middle inverse name: sets # Sets ----- <br> <br> <br> <br> --- ## Sets **Sets are lists with no duplicate entries.** - Sets can be created with curly braces `{}` — like dictionaries (!), but the syntax of the elements will make clear which you mean: ```python set1 = {3, 4, 5, 6} ``` - However, we can't use just curly braces to create an *empty* set becuase that is reserved for creating an empty dictionary – we can therefore only use `set()` to do so: ```python empty_set = set() ``` --- ## Sets (cont.) - In practice, it is common to create a set from a list – this is a way to quickly **get all unique elements within a list**: ```python my_list = [5, 6, 7, 7, 7, 8, 9, 9] set2 = set(my_list) set2 #> {5, 6, 7, 8, 9} ``` <br> - Using the `add()` method, we can add a value to a set, with a twist – **nothing will happen if the value is already present in the set**: ```py set2.add(5) set2 #> {5, 6, 7, 8, 9} ``` --- ## Comparing two sets We can use logical operators to look for shared and non-shared elements between different lists: - `setA & setB` – intersect: get elements present in *both sets* - `setA | setB` – union: get elements present in *at least one set* - `setA ^ setB` – symmetric difference: get elements present in *one set* - `setA - setB` – asymmetric difference: get elements only in `setA` - `setB - setA` – asymmetric difference: get elements only in `setB` For example: ```python #set1 = {3, 4, 5, 6} #set2 = {5, 6, 7, 8, 9} b & c #> {5, 6} b | c #> {3, 4, 5, 6, 7, 8, 9} b ^ c #> {3, 4, 7, 8, 9} ``` --- ## Comparing two sets <figure> <p align="center"> <img src=img/sets.svg width="75%"> </p> </figure> --- ## Summary of data structures | structure | bracket type | create empty | create non-empty |-----------|----------------|----------------|-------- | **list** | `[]` | `[]` | `[1, 2]` | **dictionary**| `{}` | `{}` | `{1: "a", 2: "b"}` | **tuple** | `()` | `()` | `(1, 2)` | **set** | `{}` | `set()` | `{1, 2}` <br> -- ```python type([1, 2]) #> list type({1: "a", 2: "b"}) #> dict type((1, 2)) #> tuple type({1, 2}) #> set ``` --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- class: inverse middle center # Bonus Material ---- <br> .left[ - ### [Removing items from lists and dictionaries](#rm) - ### [More on comparing sets](#more-sets) ] <br> <br> <br> <br> --- background-color: #f2f5eb name: rm ## Removing items from lists - To delete one or more items *by index*, use the *function* `del()`: ```python del(my_list[2:4]) # Delete 3rd and 4th element a #> ['x', 'b', 'e', 'f'] ``` - `pop()` – Remove the last item in the list *and return it*: ```python seq_last = seq.pop() seq #> ['T', 'K', 'A' , 'A' , 'V', 'V' ,'N', 'F'] seq_last #> 'T' ``` --- background-color: #f2f5eb ## Removing items from dictionaries - As with lists, items can be deleted with the `del()` function: ```python del(GenomeSize['Homo sapiens']) ``` <br> - `pop()` removes the specified key and returns the corresponding value: ```python GenomeSize.pop("Escherichia coli") #> 4.6 GenomeSize #> {'Arabidopsis thaliana': 157.0, #> 'Saccharomyces cerevisiae': 12.1} ``` --- background-color: #f2f5eb name: more-sets ## *Methods* to compare two sets We can do the same operations that we did above with operators like `+` and `-` with the methods `intersection()`, `union()`, `symmetric_difference()`, and `difference()`: ```python s1 = {1, 2, 3, 4} s2 = {4, 5, 6} s1.intersection(s2) # Equivalent to: s1 & s2 #> {4} s1.union(s2) # Equivalent to: s1 | s2 #> {1, 2, 3, 4, 5, 6} s1.symmetric_difference(s2) # Equivalent to: s1 ^ s2 #> {1, 2, 3, 5, 6} s1.difference(s2) # In set 1 but not in set 2 #> {1, 2, 3} s2.difference(s1) # In set 2 but not in set 1 #> {5, 6} ``` --- background-color: #f2f5eb ## *Methods* to compare two sets We can also test whether sets are *subsets* or *supersets* of each other. - Set 1 is a subset of set 2 if if s2 contains all its elements, and optionally others too: ```python # s1 = {1, 2, 3, 4} # s2 = {4, 5, 6} s1.issubset(s2) #> False ``` - Set 1 is a superset of set 2 if set 1 contains all the elements in set 2, and optionally others too: ```python s1.issuperset(s2) #> False ```