Working with files in the Unix shell II:
Viewing, summarizing and manipulating files
Week 3 – lecture C
1 Introduction
1.1 Context & overview
The Unix shell is great for many other tasks that the file browser-like ones you learned about in the previous lecture. One of those is performing basic viewing, querying, and editing operations on text files.
Recall from the Project File Organization lecture that you’ll work almost exclusively with so-called “plain-text” files in this course, such as:
- Sequence file formats such as FASTA, FASTQ, and GTF
- Spreadsheet-like/tabular files stored as TSV/CSV
- Scripts and code notebooks
- Documentation files such as Markdown
The commands we’ll discuss here can be used with plain-text files but not with “binary” formats like Excel or Word files.
1.2 Learning goals
In this lecture, you will learn to use the Unix shell to:
- View the contents of text files in various ways
- Search within, extract information from, and manipulate text files
1.3 Getting ready
Start a VS Code session like before:
Click here to see the instructions
- Log in to OSC’s OnDemand portal at https://ondemand.osc.edu
- In the blue top bar, select
Interactive Apps
and near the bottom, clickCode Server
- Fill out the form as follows:
- Cluster:
pitzer
- Account:
PAS2880
- Number of hours:
2
- Working Directory:
/fs/ess/PAS2880/user/<username>
(replace<username>
with your user name) - App Code Server version:
4.8.3
- Cluster:
- Click
Launch
- Click the
Connect to VS Code
button once it appears - In VS Code, open a terminal by clicking =>
Terminal
=>New Terminal
1 - Check that your are in
/fs/ess/PAS2880/users/$USER
by typingpwd
in the terminal.
(Recall that$USER
is a variable that represents your username. If you’re not in that dir, it may be listed underRecents
in theGet Started
document – if so, click on that entry. Otherwise, clickFile
>Open Folder
and type/select/fs/ess/PAS2880/users/$USER
.)
Open a Markdown file for notes:
- Click >
File
>New File
- Save the file inside
/fs/ess/PAS2880/users/$USER/week03
, e.g. aslectureC.md
- Click >
Change your working dir:
cd garrigos-data
2 Viewing the contents of text files
Several commands can view all or part of one or more text files, and we’ll discuss the most common ones below.
2.1 cat
The cat
command prints the entire contents of one or more files to screen:
cat README.md
# README for `garrigos-data`
- Author: Jelmer Poelstra
- Affiliation: CFAES Bioinformatics Core, The Ohio State University
- Contact: <poelstra.1@osu.edu>
- URL: <https://github.com/jelmerp/garrigos-data>
- Date: 2024-01-20 (last updated: 2025-09-09)
This directory contains files associated with the paper
“Two avian _Plasmodium_ species trigger different transcriptional responses on their vector _Culex pipiens_”
([Garrigós et al. 2025, Molecular Ecology](https://doi.org/10.1111/mec.17240)).
The files are intended for practice purposes in the context of coursework and
other tutorials on omics / RNA-Seq data analysis.
Below follows a description of the files included in this directory.
## FASTQ files (in sub-directory `data/fastq`)
The FASTQ files are Illumina RNA-seq reads from _Culex pipiens_ samples.
These were downloaded from the European Nucleotide Archive ENA database using
accession number `PRJEB41609` and the tool
`fastq-dl`](https://github.com/rpetit3/fastq-dl) v3.0.1 on 2024-01-20.
To simplify the dataset for practice purposes, the following modifications were made:
- FASTQ files from a number of sample were removed:
- 2 samples also excluded in the study itself (see the paper for details)
- All samples the 21-days time point.
- FASTQ files were randomly "subset" to keep only 500,000 reads per file using the tool
[`seqtk`](https://github.com/lh3/seqtk) v1.3-r106.
## Metadata (in sub-directory `data/meta`)
Metadata from the study was downloaded from <https://doi.org/10.20350/digitalCSIC/15708>
and simplified to keep only:
- The sample ID and treatment columns
- The samples for which the FASTQ files were retained (see above).
## Reference annotation file (in sub-directory `data/ref`)
A reference genome GTF file for the RefSeq annotation of
_Culex pipiens_ genome `TS_CPP_V2` (`GCF_016801865.2`) was downloaded from NCBI
using the [NCBI Datasets tool](https://www.ncbi.nlm.nih.gov/datasets/docs/v2/download-and-install/)
v15.31.1 on 2024-01-20.
For the analysis of the data, the reference genome FASTA file is also needed.
This is not included here,
but can be downloaded from NCBI using the following command:
```bash
wget https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/016/801/865/GCF_016801865.2_TS_CPP_V2/GCF_016801865.2_TS_CPP_V2_genomic.fna.gz
```
cat meta/metadata.tsv
sample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
ERR10802880 10dpi relictum
ERR10802885 10dpi relictum
ERR10802886 10dpi relictum
ERR10802864 24hpi cathemerium
ERR10802867 24hpi cathemerium
ERR10802870 24hpi cathemerium
ERR10802866 24hpi control
ERR10802869 24hpi control
ERR10802863 24hpi control
ERR10802871 24hpi relictum
ERR10802874 24hpi relictum
ERR10802865 24hpi relictum
ERR10802868 24hpi relictum
2.2 head
and tail
Some files, especially when working with omics data, can be huge – so printing the whole file with cat
is not always ideal. The twin commands head
and tail
can be useful, as they print only the first (head
) or last (tail
) lines of a file.
head
& tail
’s defaults are to print 10 lines:
head meta/metadata.tsv
sample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
Use the -n
option to specify the number of lines to print:
head -n 3 meta/metadata.tsv
sample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
A neat trick with tail
is to start at a specific line, -n +<starting-line>
. This is often used to skip the header line:
# '-n +2' tells tail to start at line 2:
tail -n +2 meta/metadata.tsv
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
# [...output truncated...]
Next, let’s try to take a peak inside a FASTQ file. To print the first 8 lines, corresponding to 2 reads, use -n 8
with head
:
head -n 8 fastq/ERR10802863_R1.fastq.gz
�
Խے�8�E��_1f�"�QD�J��D�fs{����Yk����d��*��
|��x���l�j�N������?������ٔ�bUs�Ng�Ǭ���i;_��������������|<�v����3��������|���ۧ��3ĐHyƕ�bIΟD�%����Sr#~��7��ν��1y�Ai,4
w\]"b�#Q����8��+[e�3d�4H���̒�l�9LVMX��U*�M����_?���\["��7�s\<_���:�$���N��v�}^����sw�|�n;<�<�oP����
i��k��q�ְ(G�ϫ��L�^��=��<���K��j�_/�[ۭV�ns:��U��G�z�ݎ�j����&��~�F��٤ZN�'��r2z}�f\#��:�9$�����H�݂�"�@M����H�C�
�0�pp���1�O��I�H�P됄�.Ȣe��Q�>���
�'�;@D8���#��St�7k�g��|�A䉻���_���d�_c������a\�|�_�mn�]�9N������l�٢ZN�c�9u�����n��n�`��
"gͺ�
���H�?2@�FC�S$n���Ԓh� nԙj��望��f �?N@�CzUlT�&�h�Pt!�r|��9~)���e�A�77�h{��~�� ��
# [...output truncated...]
Ouch! 😳 What went wrong here? (Click for the solution)
We were presented with the contents of the compressed file, which isn’t human-readable.To get around the problem you just encountered with head
, you might be inclined to decompress these files (which you could do with the gunzip
command – week 5). However, at least when it comes to FASTQ files, it is better to keep them compressed:
- Uncompressed files take up much more disk storage space than compressed ones
- Almost any bioinformatics program accepts compressed FASTQ files
- You can view these files in compressed form, as shown below with
less
. Additionally, several commands includingcat
have a counterpart for compressed files (zcat
in the case ofcat
).
Other sequence files, like assembly FASTA files and annotation GTF/GFF files, are often also compressed when you download them. These types of files, though, are more commonly decompressed before usage — e.g. because they don’t take up nearly as much space as a set of FASTQ files.
2.3 less
: A file pager
The less
command is rather different from the previous commands, which simply printed file contents to the screen and gave us our shell prompt back. Instead, less
will open a file for you to browse through, and you need to explicitly quit the program to get your prompt back.
Also, less
will automatically display gzip-compressed files in human-readable form: 🥳
less garrigos-data/fastq/ERR10802863_R1.fastq.gz
@ERR10802863.8435456 8435456 length=74
CAACGAATACATCATGTTTGCGAAACTACTCCTCCTCGCCTTGGTGGGGATCAGTACTGCGTACCAGTATGAGT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@ERR10802863.27637245 27637245 length=74
GCCACACTTTTGAAGAACAGCGTCATTGTTCTTAATTTTGTCGGCAACGCCTGCACGAGCCTTCCACGTAAGTT
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEAEE<EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
@ERR10802863.10009244 10009244 length=73
CTCGGCGTTAACTTCATCACGCAGATCATTCCGTTCCAGCAGCTGAAGCAAGACTACCGTCAGTACGAGATGA
+
AAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
-S
option to less
suppresses “line-wrapping” (Click to expand)
When lines are too long to fit on your computer screen, less
will by default “wrap” them onto a next line on the screen. That’s convenient for reading, but can be confusing for tabular formats and formats like FASTQ, because it means that one line on the screen no longer corresponds to one line in the file.
With the -S
option to less
, lines will not be wrapped but will “run out of the screen” on the right-hand side (press the rightward arrow → to see that part):
less -S garrigos-data/fastq/ERR10802863_R1.fastq.gz
# [output not shown]
When you’re inside the less
pager, you can move around in the file in several ways:
- By scrolling with your mouse
- With up ↑ and down ↓ arrows: move line-by-line
- With u (up) and d (down): move half a page at a time
- If you have them, with PgUp and PgDn keys: move page-by-page
- By pressing
G
to go the end of the file, andg
to go (back) to the top
To exit/quit less
and get your shell prompt back, simply type q:
# Type 'q' to exit the less pager:
q
Exercise: less
With less
, explore the FASTQ file a bit. Do you notice any unusual-looking reads?
Click for the solution
A number of reads are much shorter than the others, and only consist of N
s, i.e. uncalled bases. For example:
@ERR10802863.11918285 11918285 length=35
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
+
###################################
2.4 wc -l
to count lines
(This command doesn’t display file contents and would fit better in the section on Unix data tools below, but needs to be introduced for the next bit.)
The wc
command by default counts the number of lines, words, and characters in its input — but is most commonly used to only count lines, with the -l
option:
wc -l meta/metadata.tsv
23 meta/metadata.tsv
This simple operation can be surprisingly useful, as the number of lines in many file types reflects the number of entries.
3 Redirection and the pipe
First create and move into a sandbox
dir within garrigos-data
:
mkdir sandbox
cd sandbox
3.1 Redirection
The regular output of a command that is printed to the screen (like a list of files by ls
, or a number of lines by wc -l
) is technically called “standard out” or in short “stdout”. Sometimes, you may want to do something else with this output, like storing it in a file. Luckily, this is easy to do with something called “redirection”.
First, a reminder of what echo
does without redirection — it simply prints the text we provide to the screen:
echo "My first line"
My first line
Now, redirect echo
’s standard out to a new file test.txt
using the >
operator:
echo "My first line" > test.txt
No output was printed to the screen, because it instead went into the file!:
cat test.txt
My first line
The >
operator will redirect output to a file with the following rules:
- If the file doesn’t exist, it will be created.
- If the file does exist, any contents will be overwritten.
Redirect another line into that same file:
echo "My second line" > test.txt
cat test.txt
My second line
That may not have been what we intended! As explained above, the earlier file contents was overwritten. With >>
, however, output is appended (added) to a file:
echo "My third line" >> test.txt
cat test.txt
My second line
My third line
3.2 The pipe
Say that you want to count the number of entries (files and subdirs) in a directory. You could do that as follows:
# First redirect 'ls' output to a file:
ls ../fastq > filelist.txt
# Then count the nr. of lines, which is the number of files in ../fastq:
wc -l filelist.txt
44 filelist.txt
That worked, but you needed two separate lines of code, and are left with a file filelist.txt
that you probably want to remove, since it has served its sole purpose.
A more convenient way to perform this kind of operation is with a “pipe”, as follows:
ls ../fastq | wc -l
44
With the pipe, the output of the command on the left-hand side (a file listing, in this case) is redirected into the command on the right-hand side (wc
in this case). Like many others, the wc
command will gladly accept input that way (i.e., via standard in) instead of via a file name argument.
Pipes are useful because they avoid having to write/read intermediate files — this saves typing, makes the operation quicker, and reduces file clobber. In the example above, you don’t need to make a filelist.txt
file to count the number of files. We’ll see pipes in action a bunch more in the next section.
Exercise: the pipe
What do you think the command below does? What will the resuling count represent?
cat ../fastq/ERR10802863_R1.fastq.gz ../fastq/ERR10802863_R2.fastq.gz | wc -l
Click for the solution
Like many Unix commands, cat
accepts multiple arguments, i.e. it can operate on multiple files. In this case, cat
would simply print the contents of the two files back-to-back (concat
enate them!). Therefore, wc -l
will count the total number of lines across the two files.
cat ../fastq/ERR10802863_R1.fastq.gz ../fastq/ERR10802863_R2.fastq.gz | wc -l
193612
Thinking one step further: can we get the total number of reads for sample ERR10802863
across its two FASTQ files, by dividing the result by 4? (Recall: one FASTQ read covers 4 lines.) No, it doesn’t, since the file is compressed: compressed and uncompressed line counts for a file are not the same.
4 Unix data tools
We’ll now turn to some commands that may be described as “Unix data tools”. These commands are especially good for relatively simple data processing and summarizing steps, and are excellent in dealing with very large files. They are therefore quite useful when working with sequence data files.
We will cover the following commands:
grep
to search for text in filescut
to select one or more columns from tabular datasort
to sort lines, or tabular data by columnuniq
to remove duplicates
We’ll start with taking a look at one of the example data files, and discussing tabular plain-text files.
4.1 Tabular plain-text files and file extensions
The examples below will use the file meta/metadata.tsv
, so we’ll first make a copy of that file and take another look at its first lines:
cp -v ../meta/metadata.tsv .
'../meta/metadata.tsv' -> './metadata.tsv'
head metadata.tsv
sample_id time treatment
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
“Tabular” files like this one contain data that is arranged in a rows-and-columns format, i.e. as in a table or an Excel spreadsheet. Because plain-text files do not have an intrinsic way to define columns, certain characters are used as column “delimiters” in plain-text tabular files. Most commonly, these are:
- A Tab, and such files are often stored with a
.tsv
extension for Tab-Separated Values (“TSV file”). - A comma, and such files are often stored with a
.csv
extension for Comma-Separated Values (“CSV file”).
A side note on plain-text file extensions like .txt
, .csv
, and .tsv
, but also those for sequence files like .fastq
as well as script files like .R
or .sh
:
Different types of plain text files, like those in the examples above, are fundamentally the same, i.e. “just plain-text files”. Therefore, changing the file extension does not change anything about the file. Instead, different file extensions are used primarily to make it clear to humans (as opposed to the computer) what the file contains.
4.2 grep
to print lines that match a pattern
The grep
command is extremely useful and will find specific text (or text patterns) in a file. By default, it will print each line that contains a “match” in full.
Its basic syntax is grep "<pattern>" <file-path>
. For example, to print all lines from metadata.tsv
that contain “cathemerium”:
grep "cathemerium" metadata.tsv
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802864 24hpi cathemerium
ERR10802867 24hpi cathemerium
ERR10802870 24hpi cathemerium
While not always necessary, it’s good practice to consistently use quotes ("..."
) around the search pattern like above2.
Instead of printing matching lines, you can also count them with the -c
option. For example, how many control samples are in the dataset?
grep -c "control" metadata.tsv
7
The option -v
inverts grep
’s behavior and prints all lines that do not match the pattern. For example, you can combine -v
and -c
to count lines that do not contain the text “control”:
grep -vc "control" metadata.tsv
16
Exercise: grep
What output do you expect from the command below? Next, check if you were correct by executing the command.
grep -vc "contro" metadata.tsv
So, how does grep
’s behavior differ from using the *
shell wildcard?
Click for the solutions
The command gives the same output as the previous example, i.e. it successfully matches the lines with control
:
grep -vc "contro" metadata.tsv
16
While the initial examples also implied this, this above example should make it abundantly clear that grep
does not need to match entire lines or words, etc.
This behavior is very different from globbing with the *
wildcard, where the pattern has to match the entire file name!
grep
options
grep
has many other useful options, such as:-i
to ignore case (uppercase vs. lowercase) when searching-n
print the line number for each matching line-A <n>
and-B <n>
to printn
lines after and before each match-w
to make a pattern match whole “words” only-r
to search files in directories recursively (note that even without-r
,grep
can operate on multiple files)
4.3 Operating on compressed files
grep
also has a counterpart for compressed files: zgrep
, which otherwise works identically.
# Search for long non-coding RNAs (lncRNA) in the GTF file:
zgrep "lncRNA" ../ref/GCF_016801865.2.gtf.gz | head -n 2
NC_068937.1 Gnomon gene 76295 82374 . - . gene_id "LOC128092783"; transcript_id ""; db_xref "GeneID:128092783"; description "uncharacterized LOC128092783"; gbkey "Gene"; gene "LOC128092783"; gene_biotype "lncRNA";
NC_068937.1 Gnomon gene 82671 86331 . + . gene_id "LOC120427727"; transcript_id ""; db_xref "GeneID:120427727"; description "uncharacterized LOC120427727"; gbkey "Gene"; gene "LOC120427727"; gene_biotype "lncRNA";
Another idiom is to pipe the output of zcat
to grep
, or to any other command! For example:
# Search for carbonic anhydrase-related proteins in the GTF file:
zcat ../ref/GCF_016801865.2.gtf.gz | grep "carbonic anhydrase-related protein" | head -n 2
NC_068937.1 Gnomon gene 398255 486969 . - . gene_id "LOC120422535"; transcript_id ""; db_xref "GeneID:120422535"; description "carbonic anhydrase-related protein 10"; gbkey "Gene"; gene "LOC120422535"; gene_biotype "protein_coding";
NC_068937.1 Gnomon transcript 398255 486969 . - . gene_id "LOC120422535"; transcript_id "XM_039585997.2"; db_xref "GeneID:120422535"; experiment "COORDINATES: polyA evidence [ECO:0006239]"; gbkey "mRNA"; gene "LOC120422535"; model_evidence "Supporting evidence includes similarity to: 10 Proteins"; product "carbonic anhydrase-related protein 10, transcript variant X1"; transcript_biotype "mRNA";
In this week’s exercises and next week’s assignment, you’ll further explore this GTF file.
4.4 Selecting columns using cut
The cut
command selects, or we could say “cuts out”, one or more columns from a tabular file. You always have to use its -f
(“field”) option to specify the desired column(s):
# Select the second column of the file:
cut -f 1 metadata.tsv
time
10dpi
10dpi
10dpi
10dpi
10dpi
10dpi
[...output truncated...]
In many cases, with cut
and other commands alike, it can be useful to pipe the output to head
to quickly see if your command works without printing a large number (sometimes thousands) of lines:
cut -f 2 metadata.tsv | head -n 3
time
10dpi
10dpi
To select multiple columns, use a comma-delimited list, or a range with -
:
cut -f 1,3 metadata.tsv | head -n 3
sample_id treatment
ERR10802882 cathemerium
ERR10802875 cathemerium
cut -f 1-2 metadata.tsv | head -n 3
sample_id time
ERR10802882 10dpi
ERR10802875 10dpi
Note that it is not possible to change the order of columns with cut
!
cut
The default column delimiter that cut
expects is a Tab. Therefore, we didn’t have to specify the delimiter in the above examples. But when a file has a different column delimiter, use the -d
option – e.g. -d ,
for a CSV file:
# [Don't run this, hypothetical example]
# Select the second column in a comma-delimited (CSV) file:
cut -d , 2 my-data.csv
4.5 Combining cut
, sort
, and uniq
to create a list
Say you want an alphabetically sorted list of the different treatments that appear in metadata.tsv
. (That’s a bit trivial because this file is small enough to see that information at a glance, but you could use the following code also to operate on a huge genome annotation file, as you’ll do later.)
To do this, you’ll need to learn about two new commands:
sort
to sort/order/arrange rows, by default in alphanumeric order.uniq
to remove duplicates (i.e., keep all distinct/unique) entries from a sorted file/list.
We’ll build up a small “pipeline” to do this, step-by-step, and piping the output into head
at every step. First, get rid of the header line with the previously mentioned tail
trick:
tail -n +2 metadata.tsv | head
ERR10802882 10dpi cathemerium
ERR10802875 10dpi cathemerium
ERR10802879 10dpi cathemerium
ERR10802883 10dpi cathemerium
ERR10802878 10dpi control
ERR10802884 10dpi control
ERR10802877 10dpi control
ERR10802881 10dpi control
ERR10802876 10dpi relictum
ERR10802880 10dpi relictum
Second, select the column of interest with cut
:
tail -n +2 metadata.tsv | cut -f 3 | head
cathemerium
cathemerium
cathemerium
cathemerium
control
control
control
control
relictum
relictum
Third, use sort
to alphabetically sort the result:
tail -n +2 metadata.tsv | cut -f 3 | sort | head
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
cathemerium
control
control
control
Finally, use uniq
to keep only unique (distinct) values – and get rid of head
since we now have our final command:
tail -n +2 metadata.tsv | cut -f 3 | sort | uniq
cathemerium
control
relictum
Great!
Generating a count table
With a very small modification to this pipeline, you can generate a “count table” instead of a simple list! You just have to add uniq
’s -c
option (for count):
tail -n +2 metadata.tsv | cut -f 3 | sort | uniq -c
7 cathemerium
7 control
8 relictum
And this count table can in turn be sorted by most frequent occurrence — to do this, use sort
’s -n
option for numeric sorting together with -r
for reverse sorting so the largest numbers go first:
tail -n +2 metadata.tsv | cut -f 3 | sort | uniq -c | sort -nr
8 relictum
7 control
7 cathemerium
Above, we used sort
to simply sort a list. More generally, sort
will by default perform sorting based on the entire line. To sort based on one or more columns, the way you’ve probably done in Excel, use the -k
option — for example:
# Sort based on the third column:
tail -n +2 metadata.tsv | sort -k3
-k
takes a start and a stop column to sort by, so if you want to strictly sort only based on one column at a time, use this syntax:
# Explicitly indicate to sort based on the third column only:
tail -n +2 metadata.tsv | sort -k3,3
To sort first by one column and then by another, use -k
multiple times:
# Sort first by the first column, then break ties using the second column:
tail -n +2 metadata.tsv | sort -k3,3 -k2,2
Exercise
Using commands similar to the last examples above, check whether:
- The
10dpi
and24hpi
treatments have the same number of samples
Click for the solution
No, there are 12 10dpi
samples and 10 24hpi
samples:
tail -n +2 metadata.tsv | cut -f 2 | sort | uniq -c
12 10dpi
10 24hpi
- Any duplicate sample IDs are present in
metadata.tsv
Click for the solution
No duplicate sample IDs are present.
The most elegant way to see this is by using sort -nr | head -n1
at the end: if the single sample shown has a count of 1, this means there are no duplicates.
tail -n +2 metadata.tsv | cut -f 2 | sort | uniq -c | sort -nr | head -n 1
1 ERR10802886
Omitting the head -n1
or even the sort -nr
would also work in this case.
5 The Unix philosophy and data streaming
Today, and in particular with these Unix data tool examples, we saw the Unix philosophy in action:
This is the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface.
— Doug McIlory
Some advantages of a modular approach are that it’s easier to spot errors, to swap out components, and to learn.
Text/data streaming
What about those text “streams” mentioned in the quote above? Rather than loading entire files into memory, Unix tools process them one line at a time.
This is very useful when working with large files in particular! For example, head
will print results instantly even for a 10 TB file that could never be loaded into memory. By contrast, try to take a peak at such a file with a GUI text editor, and your computer will crash.
Here is one other example, combining the cat
command with globbing and redirection. This command would concatenate (combine) all your R1 FASTQ files into a single file. This is both elegantly short code, and will run quickly and without using much computer memory!
# [Don't run this - for illustration purposes only]
cat fastq/*R1.fastq.gz > all_R1.fastq.gz
- FASTQ files can (in principle) be concatenated freely since every read is a “stand-alone” unit of 4 lines.
- Gzipped (
.gz
) files can also be concatened freely!
Don’t redirect back to the input file!
One potential drawback of the “text streams” mode of operation is that you can’t redirect the output of Unix commands “back” to the input file. This will corrupt the input file.
# [Don't run this - for illustration only]
# You should NEVER run something like this:
sort metadata.tsv > metadata.tsv
Therefore, if you really want to edit the original file instead of creating a separate edited copy, you will two multiple steps:
- Redirect to a new, edited copy of the input file
- Rename the copy to overwrite the original file
# [Don't run this - for illustration only]
# Step 1 - redirect to a new file:
sort metadata.tsv > metadata_sorted.tsv
# Step 2 - rename the new file to overwrite the original (if really needed!)
mv metadata_sorted.tsv metadata.tsv
However, our general recommendation is that whenever you edit files, you keep both the original and the edited version – so you won’t typically need the above roundabout method.