This week, you’ll first learn about using NumPy and Pandas, two large and popular Python packages for data analysis. Next, we’ll talk about BioPython, which is an ecosystem of packages in the field of bioinformatics, primarily to deal with DNA sequence data.
Some of the things you will learn this week:
The key NumPy and Pandas data structures: the (n-dimensional) array and the DataFrame.
Vectorized operations and how they can be done with arrays and DataFrames.
How to index, slice, and manipulate arrays and DataFrames to analyze data.
Use various BioPython modules to download sequence data from the NCBI, parse and subset FASTA files, and run BLAST.
CSB Chapter 6: “Scientific Computing”
Sections 6.2.3-6.2.5 (“Linear algebra”, “Integration”, and “Optimization”, respectively) are optional reading and won’t be discussed in class.
These resources are mostly from CSB 6.7 (“References and Reading”), with some additions and replacements:
For more NumPy, I would recommend the official NumPy tutorial over the one mentioned in the book.
For more Pandas, I would similarly look first at the tutorials on the Pandas website.
The “Python for Data Analysis” book recommended by CBS is available online through OSU’s library. The book covers NumPy, Pandas, and plotting with Python, among other things.
A similar, excellent book is the “Python Data Science Handbook” by Jake VanderPlas, which is freely available online.
Perhaps the best place to start is this workshop tutorial by Peter Cock.
For more, the BioPython Tutorial and Cookbook has a comprehensive overview of BioPython functionality (HTML / PDF.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".