Working with files in the Unix shell II:
Viewing, summarizing and manipulating files
Week 3 – lecture C
1 Introduction
1.1 Context & overview
The Unix shell is great for many other tasks that the file browser-like ones you learned about in the previous lecture. One of those is performing basic viewing, querying, and editing operations on text files.
Recall from the Project File Organization lecture that you’ll work almost exclusively with so-called “plain-text” files in this course, such as:
- Sequence file formats such as FASTA, FASTQ, and GTF
- Spreadsheet-like/tabular files stored as TSV/CSV
- Scripts and code notebooks
- Documentation files such as Markdown
The commands we’ll discuss here can be used with plain-text files but not with “binary” formats like Excel or Word files.
1.2 Learning goals
In this lecture, you will learn to use the Unix shell to:
- View the contents of text files in various ways
- Search within, extract information from, and manipulate text files
1.3 Getting ready
Start a VS Code session like before:
Click here to see the instructions
- Log in to OSC’s OnDemand portal at https://ondemand.osc.edu
- In the blue top bar, select
Interactive Appsand near the bottom, clickCode Server - Fill out the form as follows:
- Cluster:
pitzer - Account:
PAS2880 - Number of hours:
2 - Working Directory:
/fs/ess/PAS2880/user/<username>(replace<username>with your user name) - App Code Server version:
4.8.3
- Cluster:
- Click
Launch - Click the
Connect to VS Codebutton once it appears - In VS Code, open a terminal by clicking =>
Terminal=>New Terminal1 - Check that your are in
/fs/ess/PAS2880/users/$USERby typingpwdin the terminal.
(Recall that$USERis a variable that represents your username. If you’re not in that dir, it may be listed underRecentsin theGet Starteddocument – if so, click on that entry. Otherwise, clickFile>Open Folderand type/select/fs/ess/PAS2880/users/$USER.)
Open a Markdown file for notes:
- Click >
File>New File - Save the file inside
/fs/ess/PAS2880/users/$USER/week03, e.g. aslectureC.md
- Click >
Change your working dir:
cd garrigos-data
2 Viewing the contents of text files
Several commands can view all or part of one or more text files, and we’ll discuss the most common ones below.
2.1 cat
The cat command prints the entire contents of one or more files to screen:
cat README.md# README for `garrigos-data`
- Author: Jelmer Poelstra
- Affiliation: CFAES Bioinformatics Core, The Ohio State University
- Contact: <poelstra.1@osu.edu>
- URL: <https://github.com/jelmerp/garrigos-data>
- Date: 2024-01-20 (last updated: 2025-09-09)
This directory contains files associated with the paper
“Two avian _Plasmodium_ species trigger different transcriptional responses on their vector _Culex pipiens_”
([Garrigós et al. 2025, Molecular Ecology](https://doi.org/10.1111/mec.17240)).
The files are intended for practice purposes in the context of coursework and
other tutorials on omics / RNA-Seq data analysis.
Below follows a description of the files included in this directory.
## FASTQ files (in sub-directory `data/fastq`)
The FASTQ files are Illumina RNA-seq reads from _Culex pipiens_ samples.
These were downloaded from the European Nucleotide Archive ENA database using
accession number `PRJEB41609` and the tool
`fastq-dl`](https://github.com/rpetit3/fastq-dl) v3.0.1 on 2024-01-20.
To simplify the dataset for practice purposes, the following modifications were made:
- FASTQ files from a number of sample were removed:
- 2 samples also excluded in the study itself (see the paper for details)
- All samples the 21-days time point.
- FASTQ files were randomly "subset" to keep only 500,000 reads per file using the tool
[`seqtk`](https://github.com/lh3/seqtk) v1.3-r106.
## Metadata (in sub-directory `data/meta`)
Metadata from the study was downloaded from <https://doi.org/10.20350/digitalCSIC/15708>
and simplified to keep only:
- The sample ID and treatment columns
- The samples for which the FASTQ files were retained (see above).
## Reference annotation file (in sub-directory `data/ref`)
A reference genome GTF file for the RefSeq annotation of
_Culex pipiens_ genome `TS_CPP_V2` (`GCF_016801865.2`) was downloaded from NCBI
using the [NCBI Datasets tool](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/)
v15.31.1 on 2024-01-20.
For the analysis of the data, the reference genome FASTA file is also needed.
This is not included here,
but can be downloaded from NCBI using the following command:
```bash
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/801/865/GCF_016801865.2_TS_CPP_V2/GCF_016801865.2_TS_CPP_V2_genomic.fna.gz
```
cat meta/metadata.tsvsample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
ERR10802880 10dpi relictum
ERR10802885 10dpi relictum
ERR10802886 10dpi relictum
ERR10802864 24hpi cathemerium
ERR10802867 24hpi cathemerium
ERR10802870 24hpi cathemerium
ERR10802866 24hpi control
ERR10802869 24hpi control
ERR10802863 24hpi control
ERR10802871 24hpi relictum
ERR10802874 24hpi relictum
ERR10802865 24hpi relictum
ERR10802868 24hpi relictum
2.2 head and tail
Some files, especially when working with omics data, can be huge – so printing the whole file with cat is not always ideal. The twin commands head and tail can be useful, as they print only the first (head) or last (tail) lines of a file.
head & tail’s defaults are to print 10 lines:
head meta/metadata.tsvsample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
Use the -n option to specify the number of lines to print:
head -n 3 meta/metadata.tsvsample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
A neat trick with tail is to start at a specific line, -n +<starting-line>. This is often used to skip the header line:
# '-n +2' tells tail to start at line 2:
tail -n +2 meta/metadata.tsvERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
# [...output truncated...]
Next, let’s try to take a peak inside a FASTQ file. To print the first 8 lines, corresponding to 2 reads, use -n 8 with head:
head -n 8 fastq/ERR10802863_R1.fastq.gz�
Խے�8�E��_1f�"�QD�J��D�fs{����Yk����d��*��
|��x���l�j�N������?������ٔ�bUs�Ng�Ǭ���i;_��������������|<�v����3��������|���ۧ��3ĐHyƕ�bIΟD�%����Sr#~��7��ν��1y�Ai,4
w\]"b�#Q����8��+[e�3d�4H���̒�l�9LVMX��U*�M����_?���\["��7�s\<_���:�$���N��v�}^����sw�|�n;<�<�oP����
i��k��q�ְ(G�ϫ��L�^��=��<���K��j�_/�[ۭV�ns:��U��G�z�ݎ�j����&��~�F��٤ZN�'��r2z}�f\#��:�9$�����H�݂�"�@M����H�C�
�0�pp���1�O��I�H�P됄�.Ȣe��Q�>���
�'�;@D8���#��St�7k�g��|�A䉻���_���d�_c������a\�|�_�mn�]�9N������l�٢ZN�c�9u�����n��n�`��
"gͺ�
���H�?2@�FC�S$n���Ԓh� nԙj��望��f �?N@�CzUlT�&�h�Pt!�r|��9~)���e�A�77�h{��~�� ��
# [...output truncated...]
Ouch! 😳 What went wrong here? (Click for the solution)
We were presented with the contents of the compressed file, which isn’t human-readable.To get around the problem you just encountered with head, you might be inclined to decompress these files (which you could do with the gunzip command – week 5). However, at least when it comes to FASTQ files, it is better to keep them compressed:
- Uncompressed files take up much more disk storage space than compressed ones
- Almost any bioinformatics program accepts compressed FASTQ files
- You can view these files in compressed form, as shown below with
less. Additionally, several commands includingcathave a counterpart for compressed files (zcatin the case ofcat).
Other sequence files, like assembly FASTA files and annotation GTF/GFF files, are often also compressed when you download them. These types of files, though, are more commonly decompressed before usage — e.g. because they don’t take up nearly as much space as a set of FASTQ files.
2.3 less: A file pager
The less command is rather different from the previous commands, which simply printed file contents to the screen and gave us our shell prompt back. Instead, less will open a file for you to browse through, and you need to explicitly quit the program to get your prompt back.
Also, less will automatically display gzip-compressed files in human-readable form: 🥳
less garrigos-data/fastq/ERR10802863_R1.fastq.gz@ERR10802863.8435456 8435456 length=74
CAACGAATACATCATGTTTGCGAAACTACTCCTCCTCGCCTTGGTGGGGATCAGTACTGCGTACCAGTATGAGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@ERR10802863.27637245 27637245 length=74
GCCACACTTTTGAAGAACAGCGTCATTGTTCTTAATTTTGTCGGCAACGCCTGCACGAGCCTTCCACGTAAGTT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@ERR10802863.10009244 10009244 length=73
CTCGGCGTTAACTTCATCACGCAGATCATTCCGTTCCAGCAGCTGAAGCAAGACTACCGTCAGTACGAGATGA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
-S option to less suppresses “line-wrapping” (Click to expand)
When lines are too long to fit on your computer screen, less will by default “wrap” them onto a next line on the screen. That’s convenient for reading, but can be confusing for tabular formats and formats like FASTQ, because it means that one line on the screen no longer corresponds to one line in the file.
With the -S option to less, lines will not be wrapped but will “run out of the screen” on the right-hand side (press the rightward arrow → to see that part):
less -S garrigos-data/fastq/ERR10802863_R1.fastq.gz
# [output not shown]When you’re inside the less pager, you can move around in the file in several ways:
- By scrolling with your mouse
- With up ↑ and down ↓ arrows: move line-by-line
- With u (up) and d (down): move half a page at a time
- If you have them, with PgUp and PgDn keys: move page-by-page
- By pressing
Gto go the end of the file, andgto go (back) to the top
To exit/quit less and get your shell prompt back, simply type q:
# Type 'q' to exit the less pager:
q Exercise: less
With less, explore the FASTQ file a bit. Do you notice any unusual-looking reads?
Click for the solution
A number of reads are much shorter than the others, and only consist of Ns, i.e. uncalled bases. For example:
@ERR10802863.11918285 11918285 length=35
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###################################
2.4 wc -l to count lines
(This command doesn’t display file contents and would fit better in the section on Unix data tools below, but needs to be introduced for the next bit.)
The wc command by default counts the number of lines, words, and characters in its input — but is most commonly used to only count lines, with the -l option:
wc -l meta/metadata.tsv23 meta/metadata.tsv
This simple operation can be surprisingly useful, as the number of lines in many file types reflects the number of entries.
3 Redirection and the pipe
First create and move into a sandbox dir within garrigos-data:
mkdir sandbox
cd sandbox3.1 Redirection
The regular output of a command that is printed to the screen (like a list of files by ls, or a number of lines by wc -l) is technically called “standard out” or in short “stdout”. Sometimes, you may want to do something else with this output, like storing it in a file. Luckily, this is easy to do with something called “redirection”.
First, a reminder of what echo does without redirection — it simply prints the text we provide to the screen:
echo "My first line"My first line
Now, redirect echo’s standard out to a new file test.txt using the > operator:
echo "My first line" > test.txtNo output was printed to the screen, because it instead went into the file!:
cat test.txtMy first line
The > operator will redirect output to a file with the following rules:
- If the file doesn’t exist, it will be created.
- If the file does exist, any contents will be overwritten.
Redirect another line into that same file:
echo "My second line" > test.txt
cat test.txtMy second line
That may not have been what we intended! As explained above, the earlier file contents was overwritten. With >>, however, output is appended (added) to a file:
echo "My third line" >> test.txt
cat test.txtMy second line
My third line
3.2 The pipe
Say that you want to count the number of entries (files and subdirs) in a directory. You could do that as follows:
# First redirect 'ls' output to a file:
ls ../fastq > filelist.txt
# Then count the nr. of lines, which is the number of files in ../fastq:
wc -l filelist.txt44 filelist.txt
That worked, but you needed two separate lines of code, and are left with a file filelist.txt that you probably want to remove, since it has served its sole purpose.
A more convenient way to perform this kind of operation is with a “pipe”, as follows:
ls ../fastq | wc -l44
With the pipe, the output of the command on the left-hand side (a file listing, in this case) is redirected into the command on the right-hand side (wc in this case). Like many others, the wc command will gladly accept input that way (i.e., via standard in) instead of via a file name argument.
Pipes are useful because they avoid having to write/read intermediate files — this saves typing, makes the operation quicker, and reduces file clobber. In the example above, you don’t need to make a filelist.txt file to count the number of files. We’ll see pipes in action a bunch more in the next section.
Exercise: the pipe
What do you think the command below does? What will the resuling count represent?
cat ../fastq/ERR10802863_R1.fastq.gz ../fastq/ERR10802863_R2.fastq.gz | wc -lClick for the solution
Like many Unix commands, cat accepts multiple arguments, i.e. it can operate on multiple files. In this case, cat would simply print the contents of the two files back-to-back (concatenate them!). Therefore, wc -l will count the total number of lines across the two files.
cat ../fastq/ERR10802863_R1.fastq.gz ../fastq/ERR10802863_R2.fastq.gz | wc -l193612
Thinking one step further: can we get the total number of reads for sample ERR10802863 across its two FASTQ files, by dividing the result by 4? (Recall: one FASTQ read covers 4 lines.) No, it doesn’t, since the file is compressed: compressed and uncompressed line counts for a file are not the same.
4 Unix data tools
We’ll now turn to some commands that may be described as “Unix data tools”. These commands are especially good for relatively simple data processing and summarizing steps, and are excellent in dealing with very large files. They are therefore quite useful when working with sequence data files.
We will cover the following commands:
grepto search for text in filescutto select one or more columns from tabular datasortto sort lines, or tabular data by columnuniqto remove duplicates
We’ll start with taking a look at one of the example data files, and discussing tabular plain-text files.
4.1 Tabular plain-text files and file extensions
The examples below will use the file meta/metadata.tsv, so we’ll first make a copy of that file and take another look at its first lines:
cp -v ../meta/metadata.tsv .'../meta/metadata.tsv' -> './metadata.tsv'
head metadata.tsvsample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
“Tabular” files like this one contain data that is arranged in a rows-and-columns format, i.e. as in a table or an Excel spreadsheet. Because plain-text files do not have an intrinsic way to define columns, certain characters are used as column “delimiters” in plain-text tabular files. Most commonly, these are:
- A Tab, and such files are often stored with a
.tsvextension for Tab-Separated Values (“TSV file”). - A comma, and such files are often stored with a
.csvextension for Comma-Separated Values (“CSV file”).
A side note on plain-text file extensions like .txt, .csv, and .tsv, but also those for sequence files like .fastq as well as script files like .R or .sh:
Different types of plain text files, like those in the examples above, are fundamentally the same, i.e. “just plain-text files”. Therefore, changing the file extension does not change anything about the file. Instead, different file extensions are used primarily to make it clear to humans (as opposed to the computer) what the file contains.
4.2 grep to print lines that match a pattern
The grep command is extremely useful and will find specific text (or text patterns) in a file. By default, it will print each line that contains a “match” in full.
Its basic syntax is grep "<pattern>" <file-path>. For example, to print all lines from metadata.tsv that contain “cathemerium”:
grep "cathemerium" metadata.tsvERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802864 24hpi cathemerium
ERR10802867 24hpi cathemerium
ERR10802870 24hpi cathemerium
While not always necessary, it’s good practice to consistently use quotes ("...") around the search pattern like above2.
Instead of printing matching lines, you can also count them with the -c option. For example, how many control samples are in the dataset?
grep -c "control" metadata.tsv7
The option -v inverts grep’s behavior and prints all lines that do not match the pattern. For example, you can combine -v and -c to count lines that do not contain the text “control”:
grep -vc "control" metadata.tsv16
Exercise: grep
What output do you expect from the command below? Next, check if you were correct by executing the command.
grep -vc "contro" metadata.tsvSo, how does grep’s behavior differ from using the * shell wildcard?
Click for the solutions
The command gives the same output as the previous example, i.e. it successfully matches the lines with control:
grep -vc "contro" metadata.tsv16
While the initial examples also implied this, this above example should make it abundantly clear that grep does not need to match entire lines or words, etc.
This behavior is very different from globbing with the * wildcard, where the pattern has to match the entire file name!
grep options
grephas many other useful options, such as:-ito ignore case (uppercase vs. lowercase) when searching-nprint the line number for each matching line-A <n>and-B <n>to printnlines after and before each match-wto make a pattern match whole “words” only-rto search files in directories recursively (note that even without-r,grepcan operate on multiple files)
4.3 Operating on compressed files
grep also has a counterpart for compressed files: zgrep, which otherwise works identically.
# Search for long non-coding RNAs (lncRNA) in the GTF file:
zgrep "lncRNA" ../ref/GCF_016801865.2.gtf.gz | head -n 2NC_068937.1 Gnomon gene 76295 82374 . - . gene_id "LOC128092783"; transcript_id ""; db_xref "GeneID:128092783"; description "uncharacterized LOC128092783"; gbkey "Gene"; gene "LOC128092783"; gene_biotype "lncRNA";
NC_068937.1 Gnomon gene 82671 86331 . + . gene_id "LOC120427727"; transcript_id ""; db_xref "GeneID:120427727"; description "uncharacterized LOC120427727"; gbkey "Gene"; gene "LOC120427727"; gene_biotype "lncRNA";
Another idiom is to pipe the output of zcat to grep, or to any other command! For example:
# Search for carbonic anhydrase-related proteins in the GTF file:
zcat ../ref/GCF_016801865.2.gtf.gz | grep "carbonic anhydrase-related protein" | head -n 2NC_068937.1 Gnomon gene 398255 486969 . - . gene_id "LOC120422535"; transcript_id ""; db_xref "GeneID:120422535"; description "carbonic anhydrase-related protein 10"; gbkey "Gene"; gene "LOC120422535"; gene_biotype "protein_coding";
NC_068937.1 Gnomon transcript 398255 486969 . - . gene_id "LOC120422535"; transcript_id "XM_039585997.2"; db_xref "GeneID:120422535"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gbkey "mRNA"; gene "LOC120422535"; model_evidence "Supporting evidence includes similarity to: 10 Proteins"; product "carbonic anhydrase-related protein 10, transcript variant X1"; transcript_biotype "mRNA";
In this week’s exercises and next week’s assignment, you’ll further explore this GTF file.
4.4 Selecting columns using cut
The cut command selects, or we could say “cuts out”, one or more columns from a tabular file. You always have to use its -f (“field”) option to specify the desired column(s):
# Select the second column of the file:
cut -f 1 metadata.tsvtime
10dpi
10dpi
10dpi
10dpi
10dpi
10dpi
[...output truncated...]
In many cases, with cut and other commands alike, it can be useful to pipe the output to head to quickly see if your command works without printing a large number (sometimes thousands) of lines:
cut -f 2 metadata.tsv | head -n 3time
10dpi
10dpi
To select multiple columns, use a comma-delimited list, or a range with -:
cut -f 1,3 metadata.tsv | head -n 3sample_id treatment
ERR10802882 cathemerium
ERR10802875 cathemerium
cut -f 1-2 metadata.tsv | head -n 3sample_id time
ERR10802882 10dpi
ERR10802875 10dpi
Note that it is not possible to change the order of columns with cut!
cut
The default column delimiter that cut expects is a Tab. Therefore, we didn’t have to specify the delimiter in the above examples. But when a file has a different column delimiter, use the -d option – e.g. -d , for a CSV file:
# [Don't run this, hypothetical example]
# Select the second column in a comma-delimited (CSV) file:
cut -d , 2 my-data.csv4.5 Combining cut, sort, and uniq to create a list
Say you want an alphabetically sorted list of the different treatments that appear in metadata.tsv. (That’s a bit trivial because this file is small enough to see that information at a glance, but you could use the following code also to operate on a huge genome annotation file, as you’ll do later.)
To do this, you’ll need to learn about two new commands:
sortto sort/order/arrange rows, by default in alphanumeric order.uniqto remove duplicates (i.e., keep all distinct/unique) entries from a sorted file/list.
We’ll build up a small “pipeline” to do this, step-by-step, and piping the output into head at every step. First, get rid of the header line with the previously mentioned tail trick:
tail -n +2 metadata.tsv | headERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
ERR10802880 10dpi relictum
Second, select the column of interest with cut:
tail -n +2 metadata.tsv | cut -f 3 | headcathemerium
cathemerium
cathemerium
cathemerium
control
control
control
control
relictum
relictum
Third, use sort to alphabetically sort the result:
tail -n +2 metadata.tsv | cut -f 3 | sort | headcathemerium
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
control
control
control
Finally, use uniq to keep only unique (distinct) values – and get rid of head since we now have our final command:
tail -n +2 metadata.tsv | cut -f 3 | sort | uniqcathemerium
control
relictum
Great!
Generating a count table
With a very small modification to this pipeline, you can generate a “count table” instead of a simple list! You just have to add uniq’s -c option (for count):
tail -n +2 metadata.tsv | cut -f 3 | sort | uniq -c 7 cathemerium
7 control
8 relictum
And this count table can in turn be sorted by most frequent occurrence — to do this, use sort’s -n option for numeric sorting together with -r for reverse sorting so the largest numbers go first:
tail -n +2 metadata.tsv | cut -f 3 | sort | uniq -c | sort -nr 8 relictum
7 control
7 cathemerium
Above, we used sort to simply sort a list. More generally, sort will by default perform sorting based on the entire line. To sort based on one or more columns, the way you’ve probably done in Excel, use the -k option — for example:
# Sort based on the third column:
tail -n +2 metadata.tsv | sort -k3-k takes a start and a stop column to sort by, so if you want to strictly sort only based on one column at a time, use this syntax:
# Explicitly indicate to sort based on the third column only:
tail -n +2 metadata.tsv | sort -k3,3To sort first by one column and then by another, use -k multiple times:
# Sort first by the first column, then break ties using the second column:
tail -n +2 metadata.tsv | sort -k3,3 -k2,2Exercise
Using commands similar to the last examples above, check whether:
- The
10dpiand24hpitreatments have the same number of samples
Click for the solution
No, there are 12 10dpi samples and 10 24hpi samples:
tail -n +2 metadata.tsv | cut -f 2 | sort | uniq -c 12 10dpi
10 24hpi
- Any duplicate sample IDs are present in
metadata.tsv
Click for the solution
No duplicate sample IDs are present.
The most elegant way to see this is by using sort -nr | head -n1 at the end: if the single sample shown has a count of 1, this means there are no duplicates.
tail -n +2 metadata.tsv | cut -f 2 | sort | uniq -c | sort -nr | head -n 11 ERR10802886
Omitting the head -n1 or even the sort -nr would also work in this case.
5 The Unix philosophy and data streaming
Today, and in particular with these Unix data tool examples, we saw the Unix philosophy in action:
This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
— Doug McIlory
Some advantages of a modular approach are that it’s easier to spot errors, to swap out components, and to learn.
Text/data streaming
What about those text “streams” mentioned in the quote above? Rather than loading entire files into memory, Unix tools process them one line at a time.
This is very useful when working with large files in particular! For example, head will print results instantly even for a 10 TB file that could never be loaded into memory. By contrast, try to take a peak at such a file with a GUI text editor, and your computer will crash.
Here is one other example, combining the cat command with globbing and redirection. This command would concatenate (combine) all your R1 FASTQ files into a single file. This is both elegantly short code, and will run quickly and without using much computer memory!
# [Don't run this - for illustration purposes only]
cat fastq/*R1.fastq.gz > all_R1.fastq.gz- FASTQ files can (in principle) be concatenated freely since every read is a “stand-alone” unit of 4 lines.
- Gzipped (
.gz) files can also be concatened freely!
Don’t redirect back to the input file!
One potential drawback of the “text streams” mode of operation is that you can’t redirect the output of Unix commands “back” to the input file. This will corrupt the input file.
# [Don't run this - for illustration only]
# You should NEVER run something like this:
sort metadata.tsv > metadata.tsvTherefore, if you really want to edit the original file instead of creating a separate edited copy, you will two multiple steps:
- Redirect to a new, edited copy of the input file
- Rename the copy to overwrite the original file
# [Don't run this - for illustration only]
# Step 1 - redirect to a new file:
sort metadata.tsv > metadata_sorted.tsv
# Step 2 - rename the new file to overwrite the original (if really needed!)
mv metadata_sorted.tsv metadata.tsvHowever, our general recommendation is that whenever you edit files, you keep both the original and the edited version – so you won’t typically need the above roundabout method.