Week 3 exercises: Files in the Unix shell

Author
Affiliation

Jelmer Poelstra

Published

September 5, 2025

Modified

September 11, 2025



Introduction

In these exercises, you’ll practice with Unix commands, especially those that operate on files as seen in this week’s classes.

In general, you should use Unix commands for all exercises. E.g., if you are asked to check a file size, don’t do so in the OnDemand file browser!

Setting up

  • Start a VS Code session through OSC OnDemand in the folder /fs/ess/PAS2880/users/$USER.
  • Create and use a Markdown file to keep notes on what you’re doing.
  • You’ll use the Garrigós et al. (2025) data files, which you should have stored as garrigos-data in the above dir.
Click here for instructions if you need to (re)download these files
# Make sure you're in your personal dir:
cd /fs/ess/PAS2880/users/$USER
  
git clone https://github.com/jelmerp/garrigos-data

Prelude: Lecture pages exercises

Make sure to do any of the exercises on this week’s lecture pages that we did not do in class – up until section 3 (Redirection and the pipe) of the third lecture (We will cover sections 4 and 5 next Tuesday, and you won’t need to know about those commands to do the rest of the exercises below).

Also, if you struggled with any of the exercises we did do in class, go over those again.

Exercise 1: Miscellaneous

Your starting point working dir should be /fs/ess/PAS2880/users/$USER

  1. Create a dir week03/ex1 within /fs/ess/PAS2880/users/$USER, and navigate there. Use this as the working dir for all following questions in this exercise, so avoid navigating anywhere else.

  2. What is the size of the metadata.tsv file of the Garrigós et al. (2025) data set?

  3. Copy the ref dir of the Garrigós et al. (2025) data set to your current working dir, giving the new copy the name ref_copy.

  4. Run the following command (we did not see this one in class!):

    rmdir ref_copy

    What is this trying to do and why does it fail? (Hint: check the callout box in the rm section of the lecture page)

  5. Move the GTF file in ref_copy to your current working dir, keeping the same file name.

  6. Do you think you will now be able to succesfully run rmdir ref_copy? Try it.

Exercise 2: Moving, copying and removing dummy FASTQ files

  1. Create a dir week03/ex2 within /fs/ess/PAS2880/users/$USER, and navigate there. Use this as the working dir for all following questions in this exercise, so avoid navigating anywhere else.

  2. Create a new dir seqs, and use the below command to create some dummy FASTQ files in there. Then, check the result with ls and use a command to get a count of the number of files you created.

    # [This command uses a trick called 'brace expansion' to create many files at once.]
    # [You don't need to understand how this works.]
    touch seqs/sample{01..25}_R{1,2}.fastq
  3. Copy the dummy FASTA files for samples 01 through 09 into your working dir with a single command.

  4. Remove the files you just copied and when doing so, make the command report what it’s doing.

  5. Move all R1 files into a dir seqs_R1, and then rename seqs, which should now only contain R2 files, to seqs_R2.

Exercise 3: Nuances of cp and mv behavior

  1. Create a dir week03/ex3 within /fs/ess/PAS2880/users/$USER, and navigate there.
  2. Create a new file called newfile.txt in your working dir.

For both cp and mv, when operating on files (and this works equivalently for directories):

  • If the destination path is an existing dir, the file will go into that dir and keep its original name.
  • If the destination path is not an existing dir, the (last bit of the) destination specifies the new file name.
  • A trailing slash in the destination path makes explicit that you are referring to a dir and not a file.

With that in mind, try to answer the below questions about the following command:

cp newfile.txt more_data/
  1. What do you think the command would do or attempt to do?
  2. Do you think the command will succeed? Run the command and see whether you were right.
  3. What would the command have done if you had omitted the trailing forward slash to more_data/?

Exercise 4: Exploring the GTF annotation file

GTF file info

This exercise will work with the GTF file in the example dataset, so we’ll start with some more info about GTF files.

As mentioned, the file typically starts with header lines that contain metadata and start with #. Then, the main part of a GTF file is essentially a table, with:

  • One row for each annotated “genomic feature” (gene, exon, intron, etc.)
  • Columns with information like the genomic coordinates of the features

Here is a snippet of the table part of our Culex pipiens GTF file, with manually added column names also shown in the lecture:

# [Column names are not present in the file! They are shown for informational purposes.]
seqname       source  feature     start   end     score  strand  frame    attributes
NC_068937.1   Gnomon  gene        2046    110808  .       +       .       gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding"; 
NC_068937.1   Gnomon  transcript  2046    110808  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; 
NC_068937.1   Gnomon  exon        2046    2531    .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "1"; 
NC_068937.1   Gnomon  exon        52113   52136   .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "2"; 

Some details on what’s in each column:

Column # Column ID Explanation
1 seqname Name of the chromosome, scaffold, or contig
2 source Name of the program that generated this feature, or the data source (e.g. database)
3 feature Name of the genomic feature type, e.g. “gene”, “exon”, “intron”, “CDS”
4 start Start position of the feature
5 end End position of the feature
6 score A confidence score for the feature, often absent (in which case it is .)
7 strand Whether the feature is on the + (forward) or - (reverse) strand1
8 frame The “frame” for coding features, either 0, 1 or 2 (or . for no frame). E.g. 0 means the first base of the feature is the first base of a codon
9 attributes A semicolon-separated list of key-value pairs with additional information. Can be very wide!

Your turn!

  1. Create a dir week03/ex4 within /fs/ess/PAS2880/users/$USER, and navigate there.

  2. Copy the genome annotation GTF file garrigos-data/ref/GCF_016801865.2.gtf.gz to a file called annot.gtf.gz inside your ex4 working dir.

  3. Explore annot.gtf.gz with less -S. Just look around a bit and try to navigate with the keybindings shown in the lecture, then consider the following questions:

    • How many header lines does the file have?
    • Do you understand the file’s structure?
    • Is the -S option useful here?
  4. zcat is a command that is essentially identical to cat, but can print compressed files in human-readable form. Pipe the output of zcat to another command to print only the first 15 lines of annot.gtf.gz.

  5. Count the number of lines in the GTF file both in compressed and uncompressed form. Are the counts the same?

    Click to show a hint

    Use the zcat-then-pipe trick from the previous exercise to get the “uncompressed count”. The “compressed count” on the other hand, is simply the number of lines annot.gtf.gz.

Solutions

Exercise 1

1. Create a new dir and navigate there

This assumes you start in /fs/ess/PAS2880/users/$USER:

mkdir week03/ex1
cd week03/ex1

Or, with absolute paths:

mkdir /fs/ess/PAS2880/users/$USER/week03/ex1
cd /fs/ess/PAS2880/users/$USER/week03/ex1
2. What is the size of the metadata.tsv file?

It is 633 bytes:

ls -lh ../../garrigos-data/meta/metadata.tsv
-rw-rw----+ 1 jelmer PAS0471 633 Sep  7 09:05 metadata.tsv
3. Copy the ref dir to ref_copy

Because you are copying a dir, you will need the -r (recursive) option:

cp -r ../../garrigos-data/ref ref_copy
4. Run the rmdir command
rmdir ref_copy
rmdir: failed to remove 'ref_copy': Directory not empty
5. Move the GTF file in ref_copy to your current working dir

To keep the same filename, remember to simply use the . shortcut for the current working dir as the destination:

mv ref_copy/GCF_016801865.2.gtf.gz .
6. Do you think you will now be able to succesfully run rmdir ref_copy?

Yes, because the ref_copy dir is now empty, this will work:

# [No output means it succeeds!]
rmdir ref_copy

Let’s check – indeed, the ref_copy dir is no longer there:

ls
GCF_016801865.2.gtf.gz

Exercise 2

1. Create a new dir and navigate there

Assuming you’re doing this right after exercise 1, you should be in /fs/ess/PAS2880/users/$USER/week03/ex1, in which case the following commands work:

mkdir ../ex2
cd ../ex2

Or, with absolute paths:

mkdir /fs/ess/PAS2880/users/$USER/week03/ex3
cd /fs/ess/PAS2880/users/$USER/week03/ex3
2. Create a dir with dummy sequence files
  • Create the dir:

    mkdir seqs
  • Create the dummy files:

    touch seqs/sample{01..25}_R{1,2}.fastq

    This uses a construct called brace expansion, which we won’t cover in class.

  • Check the result:

    ls seqs
    sample01_R1.fastq  sample03_R1.fastq  sample05_R1.fastq  sample07_R1.fastq  sample09_R1.fastq  sample11_R1.fastq  sample13_R1.fastq  sample15_R1.fastq  sample17_R1.fastq  sample19_R1.fastq  sample21_R1.fastq  sample23_R1.fastq  sample25_R1.fastq
    sample01_R2.fastq  sample03_R2.fastq  sample05_R2.fastq  sample07_R2.fastq  sample09_R2.fastq  sample11_R2.fastq  sample13_R2.fastq  sample15_R2.fastq  sample17_R2.fastq  sample19_R2.fastq  sample21_R2.fastq  sample23_R2.fastq  sample25_R2.fastq
    sample02_R1.fastq  sample04_R1.fastq  sample06_R1.fastq  sample08_R1.fastq  sample10_R1.fastq  sample12_R1.fastq  sample14_R1.fastq  sample16_R1.fastq  sample18_R1.fastq  sample20_R1.fastq  sample22_R1.fastq  sample24_R1.fastq
    sample02_R2.fastq  sample04_R2.fastq  sample06_R2.fastq  sample08_R2.fastq  sample10_R2.fastq  sample12_R2.fastq  sample14_R2.fastq  sample16_R2.fastq  sample18_R2.fastq  sample20_R2.fastq  sample22_R2.fastq  sample24_R2.fastq
  • Count the number of files:

    ls seqs | wc -l
    50

    Note that this gives the correct count even though when printed to screen, ls output typically used multiple columns and therefore shows more than one file per line.

    In contrast, the count from ls -l is not correct, since its output includes a header line:

    ls -l seqs | wc -l
    51
3. Copy the files for samples 01 through 09 into your working dir with a single command
  • Copy the files – recall to use . to indicate your current working dir as the destination path.

    cp seqs/sample0* .
  • Let’s check the result:

    ls
    sample01_R1.fastq  sample02_R1.fastq  sample03_R1.fastq  sample04_R1.fastq  sample05_R1.fastq  sample06_R1.fastq  sample07_R1.fastq  sample08_R1.fastq  sample09_R1.fastq  seqs
    sample01_R2.fastq  sample02_R2.fastq  sample03_R2.fastq  sample04_R2.fastq  sample05_R2.fastq  sample06_R2.fastq  sample07_R2.fastq  sample08_R2.fastq  sample09_R2.fastq
4. Remove the files you just copied
rm -v sample0*
removed 'sample01_R1.fastq'
removed 'sample01_R2.fastq'
removed 'sample02_R1.fastq'
removed 'sample02_R2.fastq'
removed 'sample03_R1.fastq'
removed 'sample03_R2.fastq'
removed 'sample04_R1.fastq'
removed 'sample04_R2.fastq'
removed 'sample05_R1.fastq'
removed 'sample05_R2.fastq'
removed 'sample06_R1.fastq'
removed 'sample06_R2.fastq'
removed 'sample07_R1.fastq'
removed 'sample07_R2.fastq'
removed 'sample08_R1.fastq'
removed 'sample08_R2.fastq'
removed 'sample09_R1.fastq'
removed 'sample09_R2.fastq'
5. Move all R1 files into a dir seqs_R1 and rename seqs
  • Create the new dir:

    mkdir seqs_R1
  • Move the R1 files into the new dir:

    mv seqs/*R1.fastq seqs_R1
  • Rename the old dir from seqs to seqs_R2:

    mv seqs seqs_R2
  • Check the result:

    tree -C
    .
    ├── seqs_R1
    │   ├── sample01_R1.fastq
    │   ├── sample02_R1.fastq
    │   ├── sample03_R1.fastq
    │   ├── sample04_R1.fastq
    │   ├── sample05_R1.fastq
    │   ├── sample06_R1.fastq
    │   ├── sample07_R1.fastq
    │   ├── sample08_R1.fastq
    │   ├── sample09_R1.fastq
    │   ├── sample10_R1.fastq
    │   ├── sample11_R1.fastq
    │   ├── sample12_R1.fastq
    │   ├── sample13_R1.fastq
    │   ├── sample14_R1.fastq
    │   ├── sample15_R1.fastq
    │   ├── sample16_R1.fastq
    │   ├── sample17_R1.fastq
    │   ├── sample18_R1.fastq
    │   ├── sample19_R1.fastq
    │   ├── sample20_R1.fastq
    │   ├── sample21_R1.fastq
    │   ├── sample22_R1.fastq
    │   ├── sample23_R1.fastq
    │   ├── sample24_R1.fastq
    │   └── sample25_R1.fastq
    └── seqs_R2
        ├── sample01_R2.fastq
        ├── sample02_R2.fastq
        ├── sample03_R2.fastq
        ├── sample04_R2.fastq
        ├── sample05_R2.fastq
        ├── sample06_R2.fastq
        ├── sample07_R2.fastq
        ├── sample08_R2.fastq
        ├── sample09_R2.fastq
        ├── sample10_R2.fastq
        ├── sample11_R2.fastq
        ├── sample12_R2.fastq
        ├── sample13_R2.fastq
        ├── sample14_R2.fastq
        ├── sample15_R2.fastq
        ├── sample16_R2.fastq
        ├── sample17_R2.fastq
        ├── sample18_R2.fastq
        ├── sample19_R2.fastq
        ├── sample20_R2.fastq
        ├── sample21_R2.fastq
        ├── sample22_R2.fastq
        ├── sample23_R2.fastq
        ├── sample24_R2.fastq
        └── sample25_R2.fastq
    
    2 directories, 50 files

Exercise 3

1. Create a new dir and navigate there

Assuming you’re doing this right after exercise 2, you should be in /fs/ess/PAS2880/users/$USER/week03/ex2, in which case the following commands work:

mkdir ../ex3
cd ../ex3

Or, with absolute paths:

mkdir /fs/ess/PAS2880/users/$USER/week03/ex3
cd /fs/ess/PAS2880/users/$USER/week03/ex3
2. Create a new file called newfile.txt in your working dir
touch newfile.txt
3. What do you think the command would do or attempt to do?

Because we added a trailing forward slash to more_data/, we are making clear that we are referring to a directory. So the command will attempt to copy the file into a dir more_data, with the file keeping the same name.

2. Do you think the command will succeed?

The more_data/ dir does not exist, and cp will not create a dir on the fly – so it will fail:

cp newfile2.txt more_data/
cp: cannot create regular file ‘more_data/’: Not a directory
3. What would the command have done if you had omitted the trailing forward slash?

If you had omitted the trailing forward slash, it would have created a copy of the file with file name more_data (if that seems confusing: note that in Unix, files don’t need to have a file extension).

P.S.: To copy the file according to the original intention, first create the destination dir and then copy:

mkdir more_data
cp newfile2.txt more_data/

Note also that once the more_data dir exists, it does not make a difference whether or not you using a trailing slash.

Exercise 4

1. Create a new dir and navigate there

Assuming you’re doing this right after exercise 2, you should be in /fs/ess/PAS2880/users/$USER/week03/ex2, in which case the following commands work:

mkdir ../ex4
cd ../ex4

Or, with absolute paths:

mkdir /fs/ess/PAS2880/users/$USER/week03/ex4
cd /fs/ess/PAS2880/users/$USER/week03/ex4
2. Copy the genome annotation file
cp ../../garrigos-data/ref/GCF_016801865.2.gtf.gz annot.gtf.gz
3. Explore annot.gtf.gz with less -S
  • The file has 4 header lines:
less -S annot.gtf.gz
#gtf-version 2.2
#!genome-build TS_CPP_V2
#!genome-build-accession NCBI_Assembly:GCF_016801865.2
#!annotation-source NCBI RefSeq GCF_016801865.2-RS_2022_12
NC_068937.1     Gnomon  gene    2046    110808  .       +       .       gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding"; 
NC_068937.1     Gnomon  transcript      2046    110808  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA";
  • The -S option is very useful here, because it’s quite hard to read the table part when lines are wrapped: each line is an entry and with line-wrapping, it’s hard to see that structure.
4. Pipe the output of zcat to another command to print the file’s first 15 lines
zcat annot.gtf.gz | head -n 15
#gtf-version 2.2
#!genome-build TS_CPP_V2
#!genome-build-accession NCBI_Assembly:GCF_016801865.2
#!annotation-source NCBI RefSeq GCF_016801865.2-RS_2022_12
NC_068937.1     Gnomon  gene    2046    110808  .       +       .       gene_id "LOC120427725"; transcript_id ""; db_xref "GeneID:120427725"; description "homeotic protein deformed"; gbkey "Gene"; gene "LOC120427725"; gene_biotype "protein_coding"; 
NC_068937.1     Gnomon  transcript      2046    110808  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "mRNA"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; 
NC_068937.1     Gnomon  exon    2046    2531    .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "1"; 
NC_068937.1     Gnomon  exon    52113   52136   .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "2"; 
NC_068937.1     Gnomon  exon    70113   70962   .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "3"; 
NC_068937.1     Gnomon  exon    105987  106087  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "4"; 
NC_068937.1     Gnomon  exon    106551  106734  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "5"; 
NC_068937.1     Gnomon  exon    109296  109660  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "6"; 
NC_068937.1     Gnomon  exon    109726  110808  .       +       .       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gene "LOC120427725"; model_evidence "Supporting evidence includes similarity to: 25 Proteins"; product "homeotic protein deformed, transcript variant X3"; transcript_biotype "mRNA"; exon_number "7"; 
NC_068937.1     Gnomon  CDS     70143   70962   .       +       0       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "CDS"; gene "LOC120427725"; product "homeotic protein deformed"; protein_id "XP_052563405.1"; exon_number "3"; 
NC_068937.1     Gnomon  CDS     105987  106087  .       +       2       gene_id "LOC120427725"; transcript_id "XM_052707445.1"; db_xref "GeneID:120427725"; gbkey "CDS"; gene "LOC120427725"; product "homeotic protein deformed"; protein_id "XP_052563405.1"; exon_number "4";
5. Count the number of lines in the GTF file both in compressed and uncompressed form
  • Uncompressed count:

    zcat annot.gtf.gz | wc -l
    408402
  • Compressed count:

    wc -l annot.gtf.gz
    22934 garrigos-data/ref/GCF_016801865.2.gtf.gz

So: no, the counts are not the same! The lesson here is that you should not get “compressed linecounts”, since these are not informative.

Back to top

References

Garrigós, Marta, Guillem Ylla, Josué Martínez-de la Puente, Jordi Figuerola, and María José Ruiz-López. 2025. “Two Avian Plasmodium Species Trigger Different Transcriptional Responses on Their Vector Culex pipiens.” Molecular Ecology 34 (15): e17240. https://doi.org/10.1111/mec.17240.

Footnotes

  1. Because double-stranded DNA is represented as a single sequence, the strand column effectively indicates in which direction a feature should be read along this sequence.↩︎