This week, we will continue learning about working in the Unix shell, using Buffalo chapter 7, with a focus on data tools. New tools will include the less
pager and the advanced commands sed
and awk
(and bioawk
).
We’ll also revisit and expand our knowledge of several tools we have already seen in week 1, such as grep
, sort
, and the use of Unix “pipelines”. The examples will focus on working with sequence data in several formats.
Some of the things you will learn this week:
How clever usage of Unix pipelines can be a large time-saver when needing to extract, convert and summarize data.
Using less
to view files and quickly find and view text matches.
Using grep
to find, count, and extract occurrences of strings and patterns.
Using sort
together with cut
and uniq
/ uniq -c
to summarize columns of tabular data.
Using awk
for many tasks on tabular data, including filtering rows and adding columns with arithmetic, and calculating means.
Use sed
to substitute text and reformat data.
Optionally, understand subshells and process substitution.
Optionally, use join
to merge files by shared columns.
This week, you will read chapter 7 of Buffalo. It’s a pretty dense chapter with lots of examples. The examples all involve sequence data but if you don’t work with sequence data, it should be fairly easy to see how these tools could be applied any kind of text and tabular data.
Buffalo Chapter 7: “Unix Data Tools”
The following sections can be skimmed or skipped, depending on your bandwidth and interest in these topics:
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".