Week 3 exercises: Files in the Unix shell
Introduction
In these exercises, you’ll practice with Unix commands, especially those that operate on files as seen in this week’s classes.
In general, you should use Unix commands for all exercises. E.g., if you are asked to check a file size, don’t do so in the OnDemand file browser!
Setting up
- Start a VS Code session through OSC OnDemand in the folder
/fs/ess/PAS2880/users/$USER
. - Create and use a Markdown file to keep notes on what you’re doing.
- You’ll use the Garrigós et al. (2025) data files, which you should have stored as
garrigos-data
in the above dir.
Click here for instructions if you need to (re)download these files
# Make sure you're in your personal dir:
cd /fs/ess/PAS2880/users/$USER
git clone https://github.com/jelmerp/garrigos-data
Prelude: Lecture pages exercises
Make sure to do any of the exercises on this week’s lecture pages that we did not do in class – up until section 3 (Redirection and the pipe) of the third lecture (We will cover sections 4 and 5 next Tuesday, and you won’t need to know about those commands to do the rest of the exercises below).
Also, if you struggled with any of the exercises we did do in class, go over those again.
Exercise 1: Miscellaneous
Your starting point working dir should be /fs/ess/PAS2880/users/$USER
Create a dir
week03/ex1
within/fs/ess/PAS2880/users/$USER
, and navigate there. Use this as the working dir for all following questions in this exercise, so avoid navigating anywhere else.What is the size of the
metadata.tsv
file of the Garrigós et al. (2025) data set?Copy the
ref
dir of the Garrigós et al. (2025) data set to your current working dir, giving the new copy the nameref_copy
.Run the following command (we did not see this one in class!):
rmdir ref_copy
What is this trying to do and why does it fail? (Hint: check the callout box in the
rm
section of the lecture page)Move the GTF file in
ref_copy
to your current working dir, keeping the same file name.Do you think you will now be able to succesfully run
rmdir ref_copy
? Try it.
Exercise 2: Moving, copying and removing dummy FASTQ files
Create a dir
week03/ex2
within/fs/ess/PAS2880/users/$USER
, and navigate there. Use this as the working dir for all following questions in this exercise, so avoid navigating anywhere else.Create a new dir
seqs
, and use the below command to create some dummy FASTQ files in there. Then, check the result withls
and use a command to get a count of the number of files you created.# [This command uses a trick called 'brace expansion' to create many files at once.] # [You don't need to understand how this works.] touch seqs/sample{01..25}_R{1,2}.fastq
Copy the dummy FASTA files for samples
01
through09
into your working dir with a single command.Remove the files you just copied and when doing so, make the command report what it’s doing.
Move all R1 files into a dir
seqs_R1
, and then renameseqs
, which should now only contain R2 files, toseqs_R2
.
Exercise 3: Nuances of cp
and mv
behavior
- Create a dir
week03/ex3
within/fs/ess/PAS2880/users/$USER
, and navigate there. - Create a new file called
newfile.txt
in your working dir.
For both cp
and mv
, when operating on files (and this works equivalently for directories):
- If the destination path is an existing dir, the file will go into that dir and keep its original name.
- If the destination path is not an existing dir, the (last bit of the) destination specifies the new file name.
- A trailing slash in the destination path makes explicit that you are referring to a dir and not a file.
With that in mind, try to answer the below questions about the following command:
cp newfile.txt more_data/
- What do you think the command would do or attempt to do?
- Do you think the command will succeed? Run the command and see whether you were right.
- What would the command have done if you had omitted the trailing forward slash to
more_data/
?
Exercise 4: Exploring the GTF annotation file
GTF file info
This exercise will work with the GTF file in the example dataset, so we’ll start with some more info about GTF files.
As mentioned, the file typically starts with header lines that contain metadata and start with #
. Then, the main part of a GTF file is essentially a table, with:
- One row for each annotated “genomic feature” (gene, exon, intron, etc.)
- Columns with information like the genomic coordinates of the features
Here is a snippet of the table part of our Culex pipiens GTF file, with manually added column names also shown in the lecture:
# [Column names are not present in the file! They are shown for informational purposes.]
seqname source feature start end score strand frame attributes
NC_068937.1 Gnomon gene 2046 110808 . + . gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding";
NC_068937.1 Gnomon transcript 2046 110808 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA";
NC_068937.1 Gnomon exon 2046 2531 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "1";
NC_068937.1 Gnomon exon 52113 52136 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "2";
Some details on what’s in each column:
Column # | Column ID | Explanation |
---|---|---|
1 | seqname | Name of the chromosome, scaffold, or contig |
2 | source | Name of the program that generated this feature, or the data source (e.g. database) |
3 | feature | Name of the genomic feature type, e.g. “gene”, “exon”, “intron”, “CDS” |
4 | start | Start position of the feature |
5 | end | End position of the feature |
6 | score | A confidence score for the feature, often absent (in which case it is . ) |
7 | strand | Whether the feature is on the + (forward) or - (reverse) strand1 |
8 | frame | The “frame” for coding features, either 0 , 1 or 2 (or . for no frame). E.g. 0 means the first base of the feature is the first base of a codon |
9 | attributes | A semicolon-separated list of key-value pairs with additional information. Can be very wide! |
Your turn!
Create a dir
week03/ex4
within/fs/ess/PAS2880/users/$USER
, and navigate there.Copy the genome annotation GTF file
garrigos-data/ref/GCF_016801865.2.gtf.gz
to a file calledannot.gtf.gz
inside yourex4
working dir.Explore
annot.gtf.gz
withless -S
. Just look around a bit and try to navigate with the keybindings shown in the lecture, then consider the following questions:- How many header lines does the file have?
- Do you understand the file’s structure?
- Is the
-S
option useful here?
zcat
is a command that is essentially identical tocat
, but can print compressed files in human-readable form. Pipe the output ofzcat
to another command to print only the first 15 lines ofannot.gtf.gz
.Count the number of lines in the GTF file both in compressed and uncompressed form. Are the counts the same?
Click to show a hint
Use the
zcat
-then-pipe trick from the previous exercise to get the “uncompressed count”. The “compressed count” on the other hand, is simply the number of linesannot.gtf.gz
.
Solutions
Exercise 1
1. Create a new dir and navigate there
This assumes you start in /fs/ess/PAS2880/users/$USER
:
mkdir week03/ex1
cd week03/ex1
Or, with absolute paths:
mkdir /fs/ess/PAS2880/users/$USER/week03/ex1
cd /fs/ess/PAS2880/users/$USER/week03/ex1
2. What is the size of the metadata.tsv
file?
It is 633 bytes:
ls -lh ../../garrigos-data/meta/metadata.tsv
-rw-rw----+ 1 jelmer PAS0471 633 Sep 7 09:05 metadata.tsv
3. Copy the ref
dir to ref_copy
Because you are copying a dir, you will need the -r
(recursive) option:
cp -r ../../garrigos-data/ref ref_copy
4. Run the rmdir
command
rmdir ref_copy
rmdir: failed to remove 'ref_copy': Directory not empty
5. Move the GTF file in ref_copy
to your current working dir
To keep the same filename, remember to simply use the .
shortcut for the current working dir as the destination:
mv ref_copy/GCF_016801865.2.gtf.gz .
6. Do you think you will now be able to succesfully run rmdir ref_copy
?
Yes, because the ref_copy
dir is now empty, this will work:
# [No output means it succeeds!]
rmdir ref_copy
Let’s check – indeed, the ref_copy
dir is no longer there:
ls
GCF_016801865.2.gtf.gz
Exercise 2
1. Create a new dir and navigate there
Assuming you’re doing this right after exercise 1, you should be in /fs/ess/PAS2880/users/$USER/week03/ex1
, in which case the following commands work:
mkdir ../ex2
cd ../ex2
Or, with absolute paths:
mkdir /fs/ess/PAS2880/users/$USER/week03/ex3
cd /fs/ess/PAS2880/users/$USER/week03/ex3
2. Create a dir with dummy sequence files
Create the dir:
mkdir seqs
Create the dummy files:
touch seqs/sample{01..25}_R{1,2}.fastq
This uses a construct called brace expansion, which we won’t cover in class.
Check the result:
ls seqs
sample01_R1.fastq sample03_R1.fastq sample05_R1.fastq sample07_R1.fastq sample09_R1.fastq sample11_R1.fastq sample13_R1.fastq sample15_R1.fastq sample17_R1.fastq sample19_R1.fastq sample21_R1.fastq sample23_R1.fastq sample25_R1.fastq sample01_R2.fastq sample03_R2.fastq sample05_R2.fastq sample07_R2.fastq sample09_R2.fastq sample11_R2.fastq sample13_R2.fastq sample15_R2.fastq sample17_R2.fastq sample19_R2.fastq sample21_R2.fastq sample23_R2.fastq sample25_R2.fastq sample02_R1.fastq sample04_R1.fastq sample06_R1.fastq sample08_R1.fastq sample10_R1.fastq sample12_R1.fastq sample14_R1.fastq sample16_R1.fastq sample18_R1.fastq sample20_R1.fastq sample22_R1.fastq sample24_R1.fastq sample02_R2.fastq sample04_R2.fastq sample06_R2.fastq sample08_R2.fastq sample10_R2.fastq sample12_R2.fastq sample14_R2.fastq sample16_R2.fastq sample18_R2.fastq sample20_R2.fastq sample22_R2.fastq sample24_R2.fastq
Count the number of files:
ls seqs | wc -l
50
Note that this gives the correct count even though when printed to screen,
ls
output typically used multiple columns and therefore shows more than one file per line.In contrast, the count from
ls -l
is not correct, since its output includes a header line:ls -l seqs | wc -l
51
3. Copy the files for samples 01 through 09 into your working dir with a single command
Copy the files – recall to use
.
to indicate your current working dir as the destination path.cp seqs/sample0* .
Let’s check the result:
ls
sample01_R1.fastq sample02_R1.fastq sample03_R1.fastq sample04_R1.fastq sample05_R1.fastq sample06_R1.fastq sample07_R1.fastq sample08_R1.fastq sample09_R1.fastq seqs sample01_R2.fastq sample02_R2.fastq sample03_R2.fastq sample04_R2.fastq sample05_R2.fastq sample06_R2.fastq sample07_R2.fastq sample08_R2.fastq sample09_R2.fastq
4. Remove the files you just copied
rm -v sample0*
removed 'sample01_R1.fastq'
removed 'sample01_R2.fastq'
removed 'sample02_R1.fastq'
removed 'sample02_R2.fastq'
removed 'sample03_R1.fastq'
removed 'sample03_R2.fastq'
removed 'sample04_R1.fastq'
removed 'sample04_R2.fastq'
removed 'sample05_R1.fastq'
removed 'sample05_R2.fastq'
removed 'sample06_R1.fastq'
removed 'sample06_R2.fastq'
removed 'sample07_R1.fastq'
removed 'sample07_R2.fastq'
removed 'sample08_R1.fastq'
removed 'sample08_R2.fastq'
removed 'sample09_R1.fastq'
removed 'sample09_R2.fastq'
5. Move all R1 files into a dir seqs_R1
and rename seqs
Create the new dir:
mkdir seqs_R1
Move the R1 files into the new dir:
mv seqs/*R1.fastq seqs_R1
Rename the old dir from
seqs
toseqs_R2
:mv seqs seqs_R2
Check the result:
tree -C
. ├── seqs_R1 │ ├── sample01_R1.fastq │ ├── sample02_R1.fastq │ ├── sample03_R1.fastq │ ├── sample04_R1.fastq │ ├── sample05_R1.fastq │ ├── sample06_R1.fastq │ ├── sample07_R1.fastq │ ├── sample08_R1.fastq │ ├── sample09_R1.fastq │ ├── sample10_R1.fastq │ ├── sample11_R1.fastq │ ├── sample12_R1.fastq │ ├── sample13_R1.fastq │ ├── sample14_R1.fastq │ ├── sample15_R1.fastq │ ├── sample16_R1.fastq │ ├── sample17_R1.fastq │ ├── sample18_R1.fastq │ ├── sample19_R1.fastq │ ├── sample20_R1.fastq │ ├── sample21_R1.fastq │ ├── sample22_R1.fastq │ ├── sample23_R1.fastq │ ├── sample24_R1.fastq │ └── sample25_R1.fastq └── seqs_R2 ├── sample01_R2.fastq ├── sample02_R2.fastq ├── sample03_R2.fastq ├── sample04_R2.fastq ├── sample05_R2.fastq ├── sample06_R2.fastq ├── sample07_R2.fastq ├── sample08_R2.fastq ├── sample09_R2.fastq ├── sample10_R2.fastq ├── sample11_R2.fastq ├── sample12_R2.fastq ├── sample13_R2.fastq ├── sample14_R2.fastq ├── sample15_R2.fastq ├── sample16_R2.fastq ├── sample17_R2.fastq ├── sample18_R2.fastq ├── sample19_R2.fastq ├── sample20_R2.fastq ├── sample21_R2.fastq ├── sample22_R2.fastq ├── sample23_R2.fastq ├── sample24_R2.fastq └── sample25_R2.fastq 2 directories, 50 files
Exercise 3
1. Create a new dir and navigate there
Assuming you’re doing this right after exercise 2, you should be in /fs/ess/PAS2880/users/$USER/week03/ex2
, in which case the following commands work:
mkdir ../ex3
cd ../ex3
Or, with absolute paths:
mkdir /fs/ess/PAS2880/users/$USER/week03/ex3
cd /fs/ess/PAS2880/users/$USER/week03/ex3
2. Create a new file called newfile.txt
in your working dir
touch newfile.txt
3. What do you think the command would do or attempt to do?
Because we added a trailing forward slash to more_data/
, we are making clear that we are referring to a directory. So the command will attempt to copy the file into a dir more_data
, with the file keeping the same name.
2. Do you think the command will succeed?
The more_data/
dir does not exist, and cp
will not create a dir on the fly – so it will fail:
cp newfile2.txt more_data/
cp: cannot create regular file ‘more_data/’: Not a directory
3. What would the command have done if you had omitted the trailing forward slash?
If you had omitted the trailing forward slash, it would have created a copy of the file with file name more_data
(if that seems confusing: note that in Unix, files don’t need to have a file extension).
P.S.: To copy the file according to the original intention, first create the destination dir and then copy:
mkdir more_data
cp newfile2.txt more_data/
Note also that once the more_data
dir exists, it does not make a difference whether or not you using a trailing slash.
Exercise 4
1. Create a new dir and navigate there
Assuming you’re doing this right after exercise 2, you should be in /fs/ess/PAS2880/users/$USER/week03/ex2
, in which case the following commands work:
mkdir ../ex4
cd ../ex4
Or, with absolute paths:
mkdir /fs/ess/PAS2880/users/$USER/week03/ex4
cd /fs/ess/PAS2880/users/$USER/week03/ex4
2. Copy the genome annotation file
cp ../../garrigos-data/ref/GCF_016801865.2.gtf.gz annot.gtf.gz
3. Explore annot.gtf.gz
with less -S
- The file has 4 header lines:
less -S annot.gtf.gz
#gtf-version 2.2
#!genome-build TS_CPP_V2
#!genome-build-accession NCBI_Assembly:GCF_016801865.2
#!annotation-source NCBI RefSeq GCF_016801865.2-RS_2022_12
NC_068937.1 Gnomon gene 2046 110808 . + . gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding";
NC_068937.1 Gnomon transcript 2046 110808 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA";
- The
-S
option is very useful here, because it’s quite hard to read the table part when lines are wrapped: each line is an entry and with line-wrapping, it’s hard to see that structure.
4. Pipe the output of zcat
to another command to print the file’s first 15 lines
zcat annot.gtf.gz | head -n 15
#gtf-version 2.2
#!genome-build TS_CPP_V2
#!genome-build-accession NCBI_Assembly:GCF_016801865.2
#!annotation-source NCBI RefSeq GCF_016801865.2-RS_2022_12
NC_068937.1 Gnomon gene 2046 110808 . + . gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding";
NC_068937.1 Gnomon transcript 2046 110808 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA";
NC_068937.1 Gnomon exon 2046 2531 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "1";
NC_068937.1 Gnomon exon 52113 52136 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "2";
NC_068937.1 Gnomon exon 70113 70962 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "3";
NC_068937.1 Gnomon exon 105987 106087 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "4";
NC_068937.1 Gnomon exon 106551 106734 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "5";
NC_068937.1 Gnomon exon 109296 109660 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "6";
NC_068937.1 Gnomon exon 109726 110808 . + . gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "7";
NC_068937.1 Gnomon CDS 70143 70962 . + 0 gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "CDS"; gene "LOC120427725"; product "homeotic protein deformed"; protein_id "XP_052563405.1"; exon_number "3";
NC_068937.1 Gnomon CDS 105987 106087 . + 2 gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "CDS"; gene "LOC120427725"; product "homeotic protein deformed"; protein_id "XP_052563405.1"; exon_number "4";
5. Count the number of lines in the GTF file both in compressed and uncompressed form
Uncompressed count:
zcat annot.gtf.gz | wc -l
408402
Compressed count:
wc -l annot.gtf.gz
22934 garrigos-data/ref/GCF_016801865.2.gtf.gz
So: no, the counts are not the same! The lesson here is that you should not get “compressed linecounts”, since these are not informative.
References
Footnotes
Because double-stranded DNA is represented as a single sequence, the
strand
column effectively indicates in which direction a feature should be read along this sequence.↩︎