pracs-sp21: Week 11 content overview and readings

Content overview for this week

This week, you’ll first learn about using NumPy and Pandas, two large and popular Python packages for data analysis. Next, we’ll talk about BioPython, which is an ecosystem of packages in the field of bioinformatics, primarily to deal with DNA sequence data.

Some of the things you will learn this week:

The key NumPy and Pandas data structures: the (n-dimensional) array and the DataFrame.
Vectorized operations and how they can be done with arrays and DataFrames.
How to index, slice, and manipulate arrays and DataFrames to analyze data.
Use various BioPython modules to download sequence data from the NCBI, parse and subset FASTA files, and run BLAST.

Readings

Required readings

CSB Chapter 6: “Scientific Computing”

Sections 6.2.3-6.2.5 (“Linear algebra”, “Integration”, and “Optimization”, respectively) are optional reading and won’t be discussed in class.

Further resources

These resources are mostly from CSB 6.7 (“References and Reading”), with some additions and replacements:

NumPy and Pandas

For more NumPy, I would recommend the official NumPy tutorial over the one mentioned in the book.
For more Pandas, I would similarly look first at the tutorials on the Pandas website.
The “Python for Data Analysis” book recommended by CBS is available online through OSU’s library. The book covers NumPy, Pandas, and plotting with Python, among other things.
A similar, excellent book is the “Python Data Science Handbook” by Jake VanderPlas, which is freely available online.

BioPython

Perhaps the best place to start is this workshop tutorial by Peter Cock.
For more, the BioPython Tutorial and Cookbook has a comprehensive overview of BioPython functionality (HTML / PDF.

Week 11 content overview and readings