Working with files in the Unix shell II:
Viewing, summarizing and manipulating files

Week 3 – lecture C

Author
Affiliation

Jelmer Poelstra

Published

September 5, 2025



1 Introduction

1.1 Context & overview

The Unix shell is great for many other tasks that the file browser-like ones you learned about in the previous lecture. One of those is performing basic viewing, querying, and editing operations on text files.

Recall from the Project File Organization lecture that you’ll work almost exclusively with so-called “plain-text” files in this course, such as:

  • Sequence file formats such as FASTA, FASTQ, and GTF
  • Spreadsheet-like/tabular files stored as TSV/CSV
  • Scripts and code notebooks
  • Documentation files such as Markdown

The commands we’ll discuss here can be used with plain-text files but not with “binary” formats like Excel or Word files.

1.2 Learning goals

In this lecture, you will learn to use the Unix shell to:

  • View the contents of text files in various ways
  • Search within, extract information from, and manipulate text files

1.3 Getting ready

  • Start a VS Code session like before:

    Click here to see the instructions

    1. Log in to OSC’s OnDemand portal at https://ondemand.osc.edu
    2. In the blue top bar, select Interactive Apps and near the bottom, click Code Server
    3. Fill out the form as follows:
      • Cluster: pitzer
      • Account: PAS2880
      • Number of hours: 2
      • Working Directory: /fs/ess/PAS2880/user/<username> (replace <username> with your user name)
      • App Code Server version: 4.8.3
    4. Click Launch
    5. Click the Connect to VS Code button once it appears
    6. In VS Code, open a terminal by clicking     => Terminal => New Terminal 1
    7. Check that your are in /fs/ess/PAS2880/users/$USER by typing pwd in the terminal.
      (Recall that $USER is a variable that represents your username. If you’re not in that dir, it may be listed under Recents in the Get Started document – if so, click on that entry. Otherwise, click File > Open Folder and type/select /fs/ess/PAS2880/users/$USER.)
  • Open a Markdown file for notes:

    1. Click > File > New File
    2. Save the file inside /fs/ess/PAS2880/users/$USER/week03, e.g. as lectureC.md
  • Change your working dir:

    cd garrigos-data

2 Viewing the contents of text files

Several commands can view all or part of one or more text files, and we’ll discuss the most common ones below.

2.1 cat

The cat command prints the entire contents of one or more files to screen:

cat README.md
# README for `garrigos-data`

- Author: Jelmer Poelstra
- Affiliation: CFAES Bioinformatics Core, The Ohio State University
- Contact: <poelstra.1@osu.edu>
- URL: <https://github.com/jelmerp/garrigos-data>
- Date: 2024-01-20 (last updated: 2025-09-09)

This directory contains files associated with the paper
“Two avian _Plasmodium_ species trigger different transcriptional responses on their vector _Culex pipiens_”
([Garrigós et al. 2025, Molecular Ecology](https://doi.org/10.1111/mec.17240)).
The files are intended for practice purposes in the context of coursework and
other tutorials on omics / RNA-Seq data analysis.

Below follows a description of the files included in this directory.

## FASTQ files (in sub-directory `data/fastq`)

The FASTQ files are Illumina RNA-seq reads from _Culex pipiens_ samples.
These were downloaded from the European Nucleotide Archive ENA database using
accession number `PRJEB41609` and the tool 
`fastq-dl`](https://github.com/rpetit3/fastq-dl) v3.0.1 on 2024-01-20.

To simplify the dataset for practice purposes, the following modifications were made:

- FASTQ files from a number of sample were removed:
  - 2 samples also excluded in the study itself (see the paper for details)
  - All samples the 21-days time point.
- FASTQ files were randomly "subset" to keep only 500,000 reads per file using the tool
  [`seqtk`](https://github.com/lh3/seqtk) v1.3-r106.

## Metadata (in sub-directory `data/meta`)

Metadata from the study was downloaded from <https://doi.org/10.20350/digitalCSIC/15708>
and simplified to keep only:

- The sample ID and treatment columns
- The samples for which the FASTQ files were retained (see above).

## Reference annotation file (in sub-directory `data/ref`)

A reference genome GTF file for the RefSeq annotation of
_Culex pipiens_ genome `TS_CPP_V2` (`GCF_016801865.2`) was downloaded from NCBI
using the [NCBI Datasets tool](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/)
v15.31.1 on 2024-01-20.

For the analysis of the data, the reference genome FASTA file is also needed.
This is not included here,
but can be downloaded from NCBI using the following command:

```bash
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/801/865/GCF_016801865.2_TS_CPP_V2/GCF_016801865.2_TS_CPP_V2_genomic.fna.gz
```
cat meta/metadata.tsv
sample_id       time    treatment
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium
ERR10802879     10dpi   cathemerium
ERR10802883     10dpi   cathemerium
ERR10802878     10dpi   control
ERR10802884     10dpi   control
ERR10802877     10dpi   control
ERR10802881     10dpi   control
ERR10802876     10dpi   relictum
ERR10802880     10dpi   relictum
ERR10802885     10dpi   relictum
ERR10802886     10dpi   relictum
ERR10802864     24hpi   cathemerium
ERR10802867     24hpi   cathemerium
ERR10802870     24hpi   cathemerium
ERR10802866     24hpi   control
ERR10802869     24hpi   control
ERR10802863     24hpi   control
ERR10802871     24hpi   relictum
ERR10802874     24hpi   relictum
ERR10802865     24hpi   relictum
ERR10802868     24hpi   relictum

2.2 head and tail

Some files, especially when working with omics data, can be huge – so printing the whole file with cat is not always ideal. The twin commands head and tail can be useful, as they print only the first (head) or last (tail) lines of a file.

head & tail’s defaults are to print 10 lines:

head meta/metadata.tsv
sample_id       time    treatment
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium
ERR10802879     10dpi   cathemerium
ERR10802883     10dpi   cathemerium
ERR10802878     10dpi   control
ERR10802884     10dpi   control
ERR10802877     10dpi   control
ERR10802881     10dpi   control
ERR10802876     10dpi   relictum

Use the -n option to specify the number of lines to print:

head -n 3 meta/metadata.tsv
sample_id       time    treatment
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium

A neat trick with tail is to start at a specific line, -n +<starting-line>. This is often used to skip the header line:

# '-n +2' tells tail to start at line 2:
tail -n +2 meta/metadata.tsv
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium
ERR10802879     10dpi   cathemerium
ERR10802883     10dpi   cathemerium
ERR10802878     10dpi   control
# [...output truncated...]

Next, let’s try to take a peak inside a FASTQ file. To print the first 8 lines, corresponding to 2 reads, use -n 8 with head:

head -n 8 fastq/ERR10802863_R1.fastq.gz
�
Խے�8�E��_1f�"�QD�J��D�fs{����Yk����d��*��
|��x���l޴�j�N������?������ٔ�bUs�Ng�Ǭ���i;_��������������|<�v����3��������|���ۧ��3ĐHyƕ�bIΟD�%����Sr#~��7��ν��1y�Ai,4
w\]"b�#Q����8��+[e�3d�4H���̒�l�9LVMX��U*�M����_?���\["��7�s\<_���:�$���N��v�}^����sw�|�n;<�<�oP����
i��k��q�ְ(G�ϫ��L�^��=��<���K��j�_/�[ۭV�ns:��U��G�z�ݎ�j����&��~�F��٤ZN�'��r2z}�f\#��:�9$�����H�݂�"�@M����H�C�
�0�pp���1�O��I�H�P됄�.Ȣe��Q�>���
�'�;@D8���#��St�7k�g��|�A䉻���_���d�_c������a\�|�_�mn�]�9N������l�٢ZN�c�9u�����n��n�`��
"gͺ�
    ���H�?2@�FC�S$n���Ԓh�       nԙj��望��f      �?N@�CzUlT�&�h�Pt!�r|��9~)���e�A�77�h{��~��     ��
# [...output truncated...]
Ouch! 😳 What went wrong here? (Click for the solution) We were presented with the contents of the compressed file, which isn’t human-readable.

To get around the problem you just encountered with head, you might be inclined to decompress these files (which you could do with the gunzip command – week 5). However, at least when it comes to FASTQ files, it is better to keep them compressed:

  • Uncompressed files take up much more disk storage space than compressed ones
  • Almost any bioinformatics program accepts compressed FASTQ files
  • You can view these files in compressed form, as shown below with less. Additionally, several commands including cat have a counterpart for compressed files (zcat in the case of cat).

Other sequence files, like assembly FASTA files and annotation GTF/GFF files, are often also compressed when you download them. These types of files, though, are more commonly decompressed before usage — e.g. because they don’t take up nearly as much space as a set of FASTQ files.

2.3 less: A file pager

The less command is rather different from the previous commands, which simply printed file contents to the screen and gave us our shell prompt back. Instead, less will open a file for you to browse through, and you need to explicitly quit the program to get your prompt back.

Also, less will automatically display gzip-compressed files in human-readable form: 🥳

less garrigos-data/fastq/ERR10802863_R1.fastq.gz
@ERR10802863.8435456 8435456 length=74
CAACGAATACATCATGTTTGCGAAACTACTCCTCCTCGCCTTGGTGGGGATCAGTACTGCGTACCAGTATGAGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@ERR10802863.27637245 27637245 length=74
GCCACACTTTTGAAGAACAGCGTCATTGTTCTTAATTTTGTCGGCAACGCCTGCACGAGCCTTCCACGTAAGTT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@ERR10802863.10009244 10009244 length=73
CTCGGCGTTAACTTCATCACGCAGATCATTCCGTTCCAGCAGCTGAAGCAAGACTACCGTCAGTACGAGATGA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

When lines are too long to fit on your computer screen, less will by default “wrap” them onto a next line on the screen. That’s convenient for reading, but can be confusing for tabular formats and formats like FASTQ, because it means that one line on the screen no longer corresponds to one line in the file.

With the -S option to less, lines will not be wrapped but will “run out of the screen” on the right-hand side (press the rightward arrow to see that part):

less -S garrigos-data/fastq/ERR10802863_R1.fastq.gz
# [output not shown]

When you’re inside the less pager, you can move around in the file in several ways:

  • By scrolling with your mouse
  • With up and down arrows: move line-by-line
  • With u (up) and d (down): move half a page at a time
  • If you have them, with PgUp and PgDn keys: move page-by-page
  • By pressing G to go the end of the file, and g to go (back) to the top

To exit/quit less and get your shell prompt back, simply type q:

# Type 'q' to exit the less pager:
q

Exercise: less

With less, explore the FASTQ file a bit. Do you notice any unusual-looking reads?

Click for the solution

A number of reads are much shorter than the others, and only consist of Ns, i.e. uncalled bases. For example:

@ERR10802863.11918285 11918285 length=35
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###################################

2.4 wc -l to count lines

(This command doesn’t display file contents and would fit better in the section on Unix data tools below, but needs to be introduced for the next bit.)

The wc command by default counts the number of lines, words, and characters in its input — but is most commonly used to only count lines, with the -l option:

wc -l meta/metadata.tsv
23 meta/metadata.tsv

This simple operation can be surprisingly useful, as the number of lines in many file types reflects the number of entries.

3 Redirection and the pipe

First create and move into a sandbox dir within garrigos-data:

mkdir sandbox
cd sandbox

3.1 Redirection

The regular output of a command that is printed to the screen (like a list of files by ls, or a number of lines by wc -l) is technically called “standard out” or in short “stdout”. Sometimes, you may want to do something else with this output, like storing it in a file. Luckily, this is easy to do with something called “redirection”.

First, a reminder of what echo does without redirection — it simply prints the text we provide to the screen:

echo "My first line"
My first line

Now, redirect echo’s standard out to a new file test.txt using the > operator:

echo "My first line" > test.txt

No output was printed to the screen, because it instead went into the file!:

cat test.txt
My first line

The > operator will redirect output to a file with the following rules:

  • If the file doesn’t exist, it will be created.
  • If the file does exist, any contents will be overwritten.

Redirect another line into that same file:

echo "My second line" > test.txt
cat test.txt
My second line

That may not have been what we intended! As explained above, the earlier file contents was overwritten. With >>, however, output is appended (added) to a file:

echo "My third line" >> test.txt
cat test.txt
My second line
My third line

3.2 The pipe

Say that you want to count the number of entries (files and subdirs) in a directory. You could do that as follows:

# First redirect 'ls' output to a file:
ls ../fastq > filelist.txt

# Then count the nr. of lines, which is the number of files in ../fastq:
wc -l filelist.txt
44 filelist.txt

That worked, but you needed two separate lines of code, and are left with a file filelist.txt that you probably want to remove, since it has served its sole purpose.

A more convenient way to perform this kind of operation is with a “pipe”, as follows:

ls ../fastq | wc -l
44

With the pipe, the output of the command on the left-hand side (a file listing, in this case) is redirected into the command on the right-hand side (wc in this case). Like many others, the wc command will gladly accept input that way (i.e., via standard in) instead of via a file name argument.

Pipes are useful because they avoid having to write/read intermediate files — this saves typing, makes the operation quicker, and reduces file clobber. In the example above, you don’t need to make a filelist.txt file to count the number of files. We’ll see pipes in action a bunch more in the next section.

Exercise: the pipe

What do you think the command below does? What will the resuling count represent?

cat ../fastq/ERR10802863_R1.fastq.gz ../fastq/ERR10802863_R2.fastq.gz | wc -l
Click for the solution

Like many Unix commands, cat accepts multiple arguments, i.e. it can operate on multiple files. In this case, cat would simply print the contents of the two files back-to-back (concatenate them!). Therefore, wc -l will count the total number of lines across the two files.

cat ../fastq/ERR10802863_R1.fastq.gz ../fastq/ERR10802863_R2.fastq.gz | wc -l
193612

Thinking one step further: can we get the total number of reads for sample ERR10802863 across its two FASTQ files, by dividing the result by 4? (Recall: one FASTQ read covers 4 lines.) No, it doesn’t, since the file is compressed: compressed and uncompressed line counts for a file are not the same.

4 Unix data tools

We’ll now turn to some commands that may be described as “Unix data tools”. These commands are especially good for relatively simple data processing and summarizing steps, and are excellent in dealing with very large files. They are therefore quite useful when working with sequence data files.

We will cover the following commands:

  • grep to search for text in files
  • cut to select one or more columns from tabular data
  • sort to sort lines, or tabular data by column
  • uniq to remove duplicates

We’ll start with taking a look at one of the example data files, and discussing tabular plain-text files.

4.1 Tabular plain-text files and file extensions

The examples below will use the file meta/metadata.tsv, so we’ll first make a copy of that file and take another look at its first lines:

cp -v ../meta/metadata.tsv .
'../meta/metadata.tsv' -> './metadata.tsv'
head metadata.tsv
sample_id       time    treatment
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium
ERR10802879     10dpi   cathemerium
ERR10802883     10dpi   cathemerium
ERR10802878     10dpi   control
ERR10802884     10dpi   control
ERR10802877     10dpi   control
ERR10802881     10dpi   control
ERR10802876     10dpi   relictum

Tabular” files like this one contain data that is arranged in a rows-and-columns format, i.e. as in a table or an Excel spreadsheet. Because plain-text files do not have an intrinsic way to define columns, certain characters are used as column “delimiters” in plain-text tabular files. Most commonly, these are:

  • A Tab, and such files are often stored with a .tsv extension for Tab-Separated Values (“TSV file”).
  • A comma, and such files are often stored with a .csv extension for Comma-Separated Values (“CSV file”).
Plain-text file extensions are flexible and mostly for human-readibility

A side note on plain-text file extensions like .txt, .csv, and .tsv, but also those for sequence files like .fastq as well as script files like .R or .sh:

Different types of plain text files, like those in the examples above, are fundamentally the same, i.e. “just plain-text files”. Therefore, changing the file extension does not change anything about the file. Instead, different file extensions are used primarily to make it clear to humans (as opposed to the computer) what the file contains.

4.2 grep to print lines that match a pattern

The grep command is extremely useful and will find specific text (or text patterns) in a file. By default, it will print each line that contains a “match” in full.

Its basic syntax is grep "<pattern>" <file-path>. For example, to print all lines from metadata.tsv that contain “cathemerium”:

grep "cathemerium" metadata.tsv
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium
ERR10802879     10dpi   cathemerium
ERR10802883     10dpi   cathemerium
ERR10802864     24hpi   cathemerium
ERR10802867     24hpi   cathemerium
ERR10802870     24hpi   cathemerium

While not always necessary, it’s good practice to consistently use quotes ("...") around the search pattern like above2.

Instead of printing matching lines, you can also count them with the -c option. For example, how many control samples are in the dataset?

grep -c "control" metadata.tsv
7

The option -v inverts grep’s behavior and prints all lines that do not match the pattern. For example, you can combine -v and -c to count lines that do not contain the text “control”:

grep -vc "control" metadata.tsv
16

Exercise: grep

What output do you expect from the command below? Next, check if you were correct by executing the command.

grep -vc "contro" metadata.tsv

So, how does grep’s behavior differ from using the * shell wildcard?

Click for the solutions

The command gives the same output as the previous example, i.e. it successfully matches the lines with control:

grep -vc "contro" metadata.tsv
16

While the initial examples also implied this, this above example should make it abundantly clear that grep does not need to match entire lines or words, etc.

This behavior is very different from globbing with the * wildcard, where the pattern has to match the entire file name!

Additional useful grep options
  • grep has many other useful options, such as:
    • -i to ignore case (uppercase vs. lowercase) when searching
    • -n print the line number for each matching line
    • -A <n> and -B <n> to print n lines after and before each match
    • -w to make a pattern match whole “words” only
    • -r to search files in directories recursively (note that even without -r, grep can operate on multiple files)

4.3 Operating on compressed files

grep also has a counterpart for compressed files: zgrep, which otherwise works identically.

# Search for long non-coding RNAs (lncRNA) in the GTF file:
zgrep "lncRNA" ../ref/GCF_016801865.2.gtf.gz | head -n 2
NC_068937.1     Gnomon  gene    76295   82374   .       -       .       gene_id "LOC128092783"; transcript_id ""; db_xref "GeneID:128092783"; description "uncharacterized LOC128092783"; gbkey "Gene"; gene "LOC128092783"; gene_biotype "lncRNA"; 
NC_068937.1     Gnomon  gene    82671   86331   .       +       .       gene_id "LOC120427727"; transcript_id ""; db_xref "GeneID:120427727"; description "uncharacterized LOC120427727"; gbkey "Gene"; gene "LOC120427727"; gene_biotype "lncRNA"; 

Another idiom is to pipe the output of zcat to grep, or to any other command! For example:

# Search for carbonic anhydrase-related proteins in the GTF file:
zcat ../ref/GCF_016801865.2.gtf.gz | grep "carbonic anhydrase-related protein" | head -n 2
NC_068937.1     Gnomon  gene    398255  486969  .       -       .       gene_id "LOC120422535"; transcript_id ""; db_xref "GeneID:120422535"; description "carbonic anhydrase-related protein 10"; gbkey "Gene"; gene "LOC120422535"; gene_biotype "protein_coding"; 
NC_068937.1     Gnomon  transcript      398255  486969  .       -       .       gene_id "LOC120422535"; transcript_id "XM_039585997.2"; db_xref "GeneID:120422535"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gbkey "mRNA"; gene "LOC120422535"; model_evidence "Supporting evidence includes similarity to: 10 Proteins"; product "carbonic anhydrase-related protein 10, transcript variant X1"; transcript_biotype "mRNA"; 

In this week’s exercises and next week’s assignment, you’ll further explore this GTF file.

4.4 Selecting columns using cut

The cut command selects, or we could say “cuts out”, one or more columns from a tabular file. You always have to use its -f (“field”) option to specify the desired column(s):

# Select the second column of the file:
cut -f 1 metadata.tsv
time
10dpi
10dpi
10dpi
10dpi
10dpi
10dpi
[...output truncated...]

In many cases, with cut and other commands alike, it can be useful to pipe the output to head to quickly see if your command works without printing a large number (sometimes thousands) of lines:

cut -f 2 metadata.tsv | head -n 3
time
10dpi
10dpi

To select multiple columns, use a comma-delimited list, or a range with -:

cut -f 1,3 metadata.tsv | head -n 3
sample_id       treatment
ERR10802882     cathemerium
ERR10802875     cathemerium
cut -f 1-2 metadata.tsv | head -n 3
sample_id       time
ERR10802882     10dpi
ERR10802875     10dpi

Note that it is not possible to change the order of columns with cut!

Specifying the delimiter for cut

The default column delimiter that cut expects is a Tab. Therefore, we didn’t have to specify the delimiter in the above examples. But when a file has a different column delimiter, use the -d option – e.g. -d , for a CSV file:

# [Don't run this, hypothetical example]
# Select the second column in a comma-delimited (CSV) file:
cut -d , 2 my-data.csv

4.5 Combining cut, sort, and uniq to create a list

Say you want an alphabetically sorted list of the different treatments that appear in metadata.tsv. (That’s a bit trivial because this file is small enough to see that information at a glance, but you could use the following code also to operate on a huge genome annotation file, as you’ll do later.)

To do this, you’ll need to learn about two new commands:

  • sort to sort/order/arrange rows, by default in alphanumeric order.
  • uniq to remove duplicates (i.e., keep all distinct/unique) entries from a sorted file/list.

We’ll build up a small “pipeline” to do this, step-by-step, and piping the output into head at every step. First, get rid of the header line with the previously mentioned tail trick:

tail -n +2 metadata.tsv | head
ERR10802882     10dpi   cathemerium
ERR10802875     10dpi   cathemerium
ERR10802879     10dpi   cathemerium
ERR10802883     10dpi   cathemerium
ERR10802878     10dpi   control
ERR10802884     10dpi   control
ERR10802877     10dpi   control
ERR10802881     10dpi   control
ERR10802876     10dpi   relictum
ERR10802880     10dpi   relictum

Second, select the column of interest with cut:

tail -n +2 metadata.tsv | cut -f 3 | head
cathemerium
cathemerium
cathemerium
cathemerium
control
control
control
control
relictum
relictum

Third, use sort to alphabetically sort the result:

tail -n +2 metadata.tsv | cut -f 3 | sort | head
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
control
control
control

Finally, use uniq to keep only unique (distinct) values – and get rid of head since we now have our final command:

tail -n +2 metadata.tsv | cut -f 3 | sort | uniq
cathemerium
control
relictum

Great!

Generating a count table

With a very small modification to this pipeline, you can generate a “count table” instead of a simple list! You just have to add uniq’s -c option (for count):

tail -n +2 metadata.tsv | cut -f 3 | sort | uniq -c
      7 cathemerium
      7 control
      8 relictum
Sorting the count table

And this count table can in turn be sorted by most frequent occurrence — to do this, use sort’s -n option for numeric sorting together with -r for reverse sorting so the largest numbers go first:

tail -n +2 metadata.tsv | cut -f 3 | sort | uniq -c | sort -nr
      8 relictum
      7 control
      7 cathemerium

Above, we used sort to simply sort a list. More generally, sort will by default perform sorting based on the entire line. To sort based on one or more columns, the way you’ve probably done in Excel, use the -k option — for example:

# Sort based on the third column:
tail -n +2 metadata.tsv | sort -k3

-k takes a start and a stop column to sort by, so if you want to strictly sort only based on one column at a time, use this syntax:

# Explicitly indicate to sort based on the third column only:
tail -n +2 metadata.tsv | sort -k3,3

To sort first by one column and then by another, use -k multiple times:

# Sort first by the first column, then break ties using the second column:
tail -n +2 metadata.tsv | sort -k3,3 -k2,2

Exercise

Using commands similar to the last examples above, check whether:

  • The 10dpi and 24hpi treatments have the same number of samples
Click for the solution

No, there are 12 10dpi samples and 10 24hpi samples:

tail -n +2 metadata.tsv | cut -f 2 | sort | uniq -c
  12 10dpi
  10 24hpi
  • Any duplicate sample IDs are present in metadata.tsv
Click for the solution

No duplicate sample IDs are present.

The most elegant way to see this is by using sort -nr | head -n1 at the end: if the single sample shown has a count of 1, this means there are no duplicates.

tail -n +2 metadata.tsv | cut -f 2 | sort | uniq -c | sort -nr | head -n 1
1 ERR10802886

Omitting the head -n1 or even the sort -nr would also work in this case.

5 The Unix philosophy and data streaming

Today, and in particular with these Unix data tool examples, we saw the Unix philosophy in action:

This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
— Doug McIlory

Some advantages of a modular approach are that it’s easier to spot errors, to swap out components, and to learn.

Text/data streaming

What about those text “streams” mentioned in the quote above? Rather than loading entire files into memory, Unix tools process them one line at a time.

This is very useful when working with large files in particular! For example, head will print results instantly even for a 10 TB file that could never be loaded into memory. By contrast, try to take a peak at such a file with a GUI text editor, and your computer will crash.

Here is one other example, combining the cat command with globbing and redirection. This command would concatenate (combine) all your R1 FASTQ files into a single file. This is both elegantly short code, and will run quickly and without using much computer memory!

# [Don't run this - for illustration purposes only]
cat fastq/*R1.fastq.gz > all_R1.fastq.gz
The above code will produce a valid compressed FASTQ file, because:
  • FASTQ files can (in principle) be concatenated freely since every read is a “stand-alone” unit of 4 lines.
  • Gzipped (.gz) files can also be concatened freely!

Don’t redirect back to the input file!

One potential drawback of the “text streams” mode of operation is that you can’t redirect the output of Unix commands “back” to the input file. This will corrupt the input file.

# [Don't run this - for illustration only]
# You should NEVER run something like this:
sort metadata.tsv > metadata.tsv
Changing the original file

Therefore, if you really want to edit the original file instead of creating a separate edited copy, you will two multiple steps:

  1. Redirect to a new, edited copy of the input file
  2. Rename the copy to overwrite the original file
# [Don't run this - for illustration only]
# Step 1 - redirect to a new file:
sort metadata.tsv > metadata_sorted.tsv

# Step 2 - rename the new file to overwrite the original (if really needed!)
mv metadata_sorted.tsv metadata.tsv

However, our general recommendation is that whenever you edit files, you keep both the original and the edited version – so you won’t typically need the above roundabout method.

Back to top

Footnotes

  1. Or use the keyboard shortcut Ctrl+`.↩︎

  2. Use double quotes by default, like in the examples, but single quotes can generally also be used though they exhibit some different behavior.↩︎