class:inverse middle center # *Week 11 - Python: Scientific computing* ---- # II: Pandas <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/03/25 --- ## Pandas Pandas is short for "Python Data Analysis Library". As NumPy has arrays, the signature data structure of Pandas it its `DataFrame`. **With Pandas' `DataFrame`s, we can analyze tabular data much more smoothly in Python.** -- If you are familiar with R's *tidyverse*, Pandas allows for similar types of operations on tabular data: - Basic manipulations such as removing and adding columns, filtering data. - Merging (joining) data sets based on shared columns. - Summarize data including with "split-apply-combine": compute statistics for multiple groups at once. - Pivoting from "wide" to "long" data. - Integration with a plotting library, Matplotlib. --- ## Getting set up Panda is built on NumPy, so one often imports NumPy along with Pandas: ```python import pandas as pd import numpy as np ``` Next, let's move into the CSB chapter's sandbox directory: ```python import os os.chdir('/fs/ess/PAS1855/users/<user>/CSB/scientific/sandbox') # os.chdir('../CSB/scientific/sandbox') ``` --- ## Reading in tabular data For tabular data to be analyzed by NumPy and Pandas, we don't typically read files line-by-line, but we load them all at once – the standard Pandas function for this is **`read_csv()`**: ```python data = pd.read_csv("../data/Dale2015_data.csv") ``` <br> .content-box-info[ The argument `sep` can be used to specify the delimiter, e.g. for tab-delimited files (`sep = '\t'`), though Pandas can typically auto-detect this too. There are corresponding functions to read in **Excel** and **JSON** files, among others. ] --- ## Exploring DataFrames - Show the first 6 rows with `head()`: ```python data.head() #> Scientific_name English_name TipLabel Female_plumage_score Male_plumage_score #> 0 Abroscopus albogularis Rufous-faced Warbler Abroscopus_albogularis 45.833333 51.944444 #> 1 Abroscopus schisticeps Black-faced Warbler Abroscopus_schisticeps 48.888889 47.777778 #> 2 Abroscopus superciliaris Yellow-bellied Warbler Abroscopus_superciliaris 43.194444 43.611111 #> 3 Acanthagenys rufogularis Spiny-cheeked Honeyeater Acanthagenys_rufogularis 39.305556 39.166667 #> 4 Acanthidops bairdii Peg-billed Finch Acanthidops_bairdii 40.555556 54.166667 ``` - Just typing the name of the dataframe will also return a very useful view in our interactive Python window: ```python data ``` --- ## Exploring DataFrames: attributes - **Note that NumPy functions, methods, and attributes will also work for Pandas `DataFrames`!** For instance, we can check the length along each dimension with `shape`: ```python data.shape #> (5831, 5) ``` --- ## Exploring DataFrames: data types - `dtypes` returns data types by column (note, strings are called `object`): ```python data.dtypes #> data_raw.dtypes #> Scientific_name object #> English_name object #> TipLabel object #> Female_plumage_score float64 #> Male_plumage_score float64 #> dtype: object ``` .content-box-info[ Pandas `DataFrame`s can be thought of as **a set of *aligned* 1-dimensional NumPy arrays of equal length**, with each `DataFrame` column representing an array. Each column contains elements of the same data type while different columns typically have different data types. (You can also think of `DataFrame`s is as a **special kind of dictionary**: each column is an item with column names as keys.) ] --- ## Exploring DataFrames: indices - `columns` returns an `Index` object with the column names: ```python data.columns #> Index(['Scientific_name', 'English_name', 'TipLabel', #> 'Female_plumage_score', 'Male_plumage_score'], #> dtype='object') ``` <br> - A `DataFrame` also has an explicit **rowwise index** (noticed the row numbering when we printed `data`?): ```python data.index #> RangeIndex(start=0, stop=5831, step=1) ``` This index can also be manipulated and could for example consist of *strings* as well. --- ## Exploring DataFrames: describe() The method `describe()` gives an overview of the numerical columns: ```python data.describe() #> Female_plumage_score Male_plumage_score #> count 5831.000000 5831.000000 #> mean 47.527630 51.009189 #> std 6.997343 8.200663 #> min 34.166667 34.027778 #> 25% 42.361111 44.166667 #> 50% 45.833333 49.722222 #> 75% 51.250000 57.083333 #> max 72.222222 74.166667 ``` --- ## Exploring DataFrames: info() And `info()` gives another type of useful overview, for all columns: ```python data.info() #> <class 'pandas.core.frame.DataFrame'> #> RangeIndex: 5831 entries, 0 to 5830 #> Data columns (total 5 columns): #> # Column Non-Null Count Dtype #> --- ------ -------------- ----- #> 0 Scientific_name 5831 non-null object #> 1 English_name 5831 non-null object #> 2 TipLabel 5831 non-null object #> 3 Female_plumage_score 5831 non-null float64 #> 4 Male_plumage_score 5831 non-null float64 #> dtypes: float64(2), object(3) #> memory usage: 227.9+ KB ``` --- ## Exploring DataFrames: selecting columns We can specify columns not with the syntax `df["colname"]` (think dictionary!), as well as with the shorthand `df.colname`: ```python data["Scientific_name"] data.Scientific_name #> 0 Abroscopus albogularis #> 1 Abroscopus schisticeps #> 2 Abroscopus superciliaris #> 3 Acanthagenys rufogularis #> 4 Acanthidops bairdii #> ... #> 5826 Zosterops uropygialis #> 5827 Zosterops vaughani #> 5828 Zosterops vellalavella #> 5829 Zosterops wallacei #> 5830 Zosterops xanthochroa #> Name: latin, Length: 5831, dtype: object ``` When we extract a single column, the resulting 1-dimensional Pandas array is called a `Series`. --- ## Manipulating DataFrames: renaming columns - We can rename columns using `rename()` with a **dictionary of old and new column names.** Let's do so to make the column names and therefore our lines of code a bit shorter: ```python data = data.rename(columns={"Scientific_name": "latin", "English_name": "english", "Female_plumage_score": "score_f", "Male_plumage_score": "score_m"}) ``` --- ## Manipulating DataFrames: adding columns - Create a new column that sums the plumage score for males and females: ```python data["score_sum"] = data["score_f"] + data["score_m"] data.head() ``` -- <br> - We could also easily add a column with a constant using a vectorized operation, like we have seen with NumPy: ```python data["Study"] = 1 data.head() ``` --- ## Manipulating DataFrames: removing columns - We can use the familiar, generic `del()` function to drop one column, or the more verbose `drop()` method to remove one or more columns: ```python del(data["score_sum"]) data.drop(["TipLabel", "Study"], axis = 1, inplace = True) ``` - `axis = 0` acts along rows, and `axis = 1` along columns. - `inplace` means that the `DataFrame` is modified in place. --- ## Thursday setup ```python import pandas as pd import numpy as np import os username = os.environ.get('USER') os.chdir('/fs/ess/PAS1855/users/' + username + '/CSB/scientific/sandbox') ``` ```python !head "../data/Dale2015_data.csv" #> Scientific_name,English_name,TipLabel,Female_plumage_score,Male_plumage_score #> Abroscopus albogularis,Rufous-faced Warbler,Abroscopus_albogularis,45.83333333,51.94444444 #> Abroscopus schisticeps,Black-faced Warbler,Abroscopus_schisticeps,48.88888889,47.77777778 #> Abroscopus superciliaris,Yellow-bellied Warbler,Abroscopus_superciliaris,43.19444444,43.61111111 ``` ```python data = pd.read_csv("../data/Dale2015_data.csv") data = data.rename(columns={"Scientific_name": "latin", "English_name": "english", "Female_plumage_score": "score_f", "Male_plumage_score": "score_m"}) ``` --- ## Indexing and slicing DataFrames - After selecting a column, we can slice it just like we would expect to for an array/list: ```python data["latin"][:3] # First three items #> 0 Abroscopus albogularis #> 1 Abroscopus schisticeps #> 2 Abroscopus superciliaris ``` <br> - But we can't slice the entire `DataFrame` as we may have expected after learning NumPy: ```python data[:3, :3] #> TypeError: '(slice(None, 3, None), slice(None, 3, None))' is an invalid key ``` --- ## Indexing and slicing DataFrames (cont.) - Instead, we need to use the **`iloc()` method**: ```python data.iloc[2, 1] #> 'Yellow-bellied Warbler' # First 2 rows, all columns (out-of-range slices work): data.iloc[:2, :10] # First 2 rows, columns 2 and 4: data.iloc[:2, [1, 3]] # Last 5 rows, all columns: data.iloc[-5:, :] ``` --- ## Indexing and slicing DataFrames (cont.) We can also use the **`loc` method**, which selects rows and columns using **explicit labels**. The labels for the columns are –of course– the column names, whereas the labels for the rows are in the **row number index**. Notice that ranges given with `loc` are *inclusive*: so `data.loc[:3, ]` returns the first *four* rows! ```python data.loc[:3, ["latin", "english"]] #> 0 Abroscopus albogularis Rufous-faced Warbler #> 1 Abroscopus schisticeps Black-faced Warbler #> 2 Abroscopus superciliaris Yellow-bellied Warbler #> 3 Acanthagenys rufogularis Spiny-cheeked Honeyeater ``` --- ## Masking (filtering) DataFrames Like we've seen for NumPy arrays, we can also perform **masking**: filter the data based on Booleans retrieved after a test. - For instance, to select the row(s) for a certain species: ```python data["latin"] == "Zosterops mouroniensis" #> 0 False #> 1 False #> 2 False #> ... data[data["latin"] == "Zosterops mouroniensis"] # Scientific_name English_name TipLabel Female_plumage_score Male_plumage_score Sum_scores Study # 5801 Zosterops mouroniensis Mount Karthala White-eye Zosterops_mouroniensis 47.916667 47.5 95.416667 1 ``` -- - Or get only the most highly dimorphic species: ```python data[(data["score_m"] > 70) & (data["score_f"] < 40)] #> latin english TipLabel score_f score_m #> 875 Chiroxiphia pareola Blue-backed Manakin Chiroxiphia_pareola 39.305556 71.666667 #> 1355 Cyornis hainanus Hainan Blue-flycatcher Cyornis_hainanus 37.222222 71.250000 #> 1811 Euplectes hartlaubi Marsh Widowbird Euplectes_hartlaubi 39.583333 70.555556 ``` --- <figure> <p align="center"> <img src=img/Blue-backedManakin.jpeg width="62%"> <figcaption>Blue-backed Manakin - males (<a href="http://www.arctracer.com/photos/trinidad/birds/Blue-backedManakin.html">Image source</a>)</figcaption> </p> </figure> <figure> <p align="center"> <img src=img/manakin_female.jpg width="62%"> <figcaption>Blue-backed Manakin - female (<a href="http://www.arthurgrosset.com/sabirds/photos/chipar28910.jpg">Image source</a>)</figcaption> </p> </figure> --- <figure> <p align="center"> <img src=img/marsh_widowbird.jpg width="60%"> <figcaption>Marsh Widowbird - male (<a href="http://www.arthurgrosset.com/sabirds/photos/chipar28910.jpg">Image source</a>)</figcaption> </p> </figure> <figure> <p align="center"> <img src=img/marsh_widowbird_female.jpg width="60%"> <figcaption>Marsh Widowbird - female (<a href="http://www.arthurgrosset.com/sabirds/photos/chipar28910.jpg">Image source</a>)</figcaption> </p> </figure> --- ## Summary statistics Like we have seen with NumPy, we can also easily get summary statistics: ```python data["score_m"].mean() #> 51.009189390042877 data["score_m"].median() #> 49.72222222 data["score_m"].std() #> 8.2006629346736908 ``` <br> .content-box-info[ There are many more possibilities to summarize (and manuipulate) `DataFrame`s – for instance, statistics can be computer separately for multiple groups at once using `groupby()` ] --- ## Some simple plots with Matplotlib – histogram: ```python data["score_m"].hist() ``` <p align="center"> <img src=img/hist.svg width="90%"> </p> --- ## Simple plots with Matplotlib – scatterplot ```python data.plot.scatter(x = "score_m", y = "score_f") ``` <p align="center"> <img src=img/scatter.svg width="90%"> </p> --- ## Simple plots with Matplotlib – boxplot: ```python data[["score_m", "score_f"]].plot.box() ``` <p align="center"> <img src=img/boxplot.svg width="90%"> </p> --- ## <i class="fa fa-user-edit"></i> Your turn 1. For how many species do both males and females have a plumage score higher than 70? And what are their English names? --- ## <i class="fa fa-user-edit"></i> Your turn: Solutions ```python data[(data["score_m"] > 70) & (data["score_f"] > 70)].shape[0] #> 5 data[(data["score_m"] > 70) & (data["score_f"] > 70)]["english"] #> 363 Unicoloured Jay #> 1520 Indigo Flowerpiercer #> 1717 Red Warbler #> 1736 Red-headed Parrotfinch #> 2739 Red-headed Malimbe #> Name: english, dtype: object ``` <figure> <p align="center"> <img src=img/red_warbler.jpg width="45%"> <figcaption>Red Warbler of unknown sex (<a href="https://ebird.org/species/redwar1">Image source</a>)</figcaption> </p> </figure> --- ## <i class="fa fa-user-edit"></i> Pandas exercise for homework <br> The third exercise for this week uses Pandas to investigate whether scientific manuscript rejection rates are higher when the manuscript gets more reviewers. --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- class: center middle inverse # Bonus material ----- <br> <br> <br> <br> --- background-color: #f2f5eb ## "Views" versus copies in Pandas When we slice and filter `DataFrames`, we generally get a "**view**" of the data, as opposed to a *copy* of the data. Therefore, we get a warning when we try to modify such as view: ```python high_male_score = data[data["score_m"] > 65] high_male_score["score_qual"] = "High" #> [...] SettingWithCopyWarning: #> A value is trying to be set on a copy of a slice from a DataFrame. ``` -- Our operation still worked – a column has been added to `high_male_score` but not to data: ```python high_male_score.columns #>Index(['latin', 'english', 'TipLabel', 'score_f', 'score_m', 'score_qual'], #> dtype='object') data.columns #> Index(['latin', 'english', 'TipLabel', 'score_f', 'score_m'], #> dtype='object') ``` --- background-color: #f2f5eb ## "Views" versus copies in Pandas (cont.) But in such case, we should instead make a copy of the data: ```python high_male_score = data[data["score_m"] > 65].copy() high_male_score["score_qual"] = "High" ``` --- background-color: #f2f5eb ## Masking (filtering) DataFrames - We can even select rows based on only part of the cell content, using the **`str.contains()`** method: ```python data[data.english.str.contains("Bower")] #> [All rows containing "Bower" in the "english" column] data[data.english.str.contains("Bower")]["latin"][:3] #> 188 Amblyornis flavifrons #> 189 Amblyornis inornata #> 190 Amblyornis macgregoriae #> Name: Scientific_name, dtype: object ``` .content-box-q[ Try this: ```python data[data.english.str.contains("Bower")].loc[:2, "latin"] ``` Why is it not returning any rows? ]