Graded Assignment 2:
Unix shell data tools and version control
Week 4 graded assignment – due Sunday Sep 21 at 11:59 pm
Overview
This graded assignment is worth 10 points and is due on Sunday Sep 21st. You will practice with Unix shell commands, especially the “data tools” commands, and with Git commands.
Directions and grading
Submission expectations
- Deadline: Sunday September 21st at 11:59 pm.
- Submission: You don’t need to submit anything or notify the instructors: we will simply look for these files in the appropriate location at OSC. (We will also be able to see when the files were last modified.)
- When to start: We recommend that you only get started with this assignment “for real” after Thursday’s class. Many of the questions revolve around commands you’ll have learned by Tuesday, so you could do a partial practice run after that lecture.
Academic integrity
![]() |
Use of generative AI Tools (e.g. ChatGPT, Microsoft Copilot, Google Gemini) is not permitted. |
![]() |
Getting help on the assignment is not permitted |
![]() |
Collaborating, or completing the assignment with others, is not permitted |
![]() |
Copying or reusing previous work is not permitted |
![]() |
Open-book research for the assignment is permitted |
![]() |
APA Citations and/or formatting for this assignment are not required |
Rubric
You can earn a maximum of 0.5 points for each of the numbered 20 questions/items below. There is a bonus 21st question that can earn you 1 point if you dropped points elsewhere.
Detailed steps
General
Work in VS Code at OSC as per usual.
Use Unix shell and Git commands to answer all questions unless explicitly instructed not to do so.
In the Markdown file with your answers (see below), copy each of the numbered instructions/questions into your document, and insert your answers in between. The code parts of your answers should be entered in Markdown code blocks, and that code’s output should also be part of your answer. For example:
1. List the files in your current working dir I used the following command: ```bash ls ``` The output was: ``` file1.txt file2.txt annot.gtf ```
Setting up
Create a new dir for this assignment,
/fs/ess/PAS2880/users/$USER/GA2
, and navigate there. This will be your working dir for the entire assignment. Inside it, also create a dirdata
and a dirresults
.In your working dir, create a
README.md
file and open it in the VS Code editor. Use this file throughout the assigment to add your answers.
Part A: Starting a Git repository
Add a header to your README file to indicate that this file (and dir) is for your assignment.
Initialize a Git repository inside your new directory.
Add & stage and then commit the README file with an appropriate commit message.
Part B: Exploring the GTF file with Unix data tools
Copy the genome annotation GTF file
GCF_016801865.2.gtf.gz
that is in your copy of the Garrigós et al. (2025) data to a filedata/annot.gtf.gz
(i.e., into thedata
dir you just created).Check the file size of
annot.gtf.gz
. Then, decompress the GTF file with thegunzip
command below. Finally, check the size of the resultingannot.gtf
file. How large is the size difference between the compressed and uncompressed files?gunzip data/annot.gtf.gz
The total number of “genomic features” (genes, exons, etc.) listed in a GTF file is simply the number of lines in its main/table section.
- Count the total number of lines in
annot.gtf
. - Take a look at the contents of the file and count the number of header lines by eye.
- Based on your knowledge of what the GTF header lines look like, combine
grep -v
andwc -l
to count the number of genomic features in the file.
- Count the total number of lines in
Does the difference between the two line counts you just performed correspond to the number of header lines you counted visually? If not, can you figure out which other line(s) than the header lines contain the pattern you used with
grep
above?Click here for a hint
Review the “Additional
grep
tips” box in the section ongrep
to find an option that might help you make sense of that.How many distinct “sequences” (chromosome/scaffolds/contigs) are represented in the annotation file? Store a list of these distinct sequences in a file
results/scaffolds.txt
.Click here for a hint
In this and all following questions, you’ll only want to consider the table part of the file, so you should start your commands with
grep -v
as above.Print a “count table” (like we did towards the end of this section) of feature types in the annotation: this should have a count of the number of times each feature type occurs. (Such a table will for example show how many genes have been annotated in this genome – quite useful!)
Check the status of your Git repository. You should see that your
README
has changed but also that Git has detected the files in thedata
andresults
dirs. You don’t want to commit any files in thedata
andresults
dirs, so create a.gitignore
file that makes Git ignore them. Check the status of your repository again – is your.gitignore
file working as intended?Stage and commit the
.gitignore
file.Stage and commit the
README.md
file again.
Part C: Exploring the FASTQ files with Unix data tools
Copy the FASTQ file
ERR10802863_R1.fastq.gz
of the Garrigós et al. (2025) data to a file of the same name inside the samedata
dir you copied the GTF file into.Check the status of your Git repository. Does the FASTQ file show up? Why/why not?
How many reads does the FASTQ file contain?
Click here for hints
Recall from last week’s exercises that a direct linecount of the compressed file will not give you the correct answer.
Also, don’t forget to convert the number of lines to the number of reads.
How many reads in the FASTQ file contain the sequence
ACGT
? (You can assume this string does not occur outside of the lines with sequences.)How many reads contain at least 10 consecutive
N
s (uncalled bases)? (You can assume that stretches of N’s do not occur in the sequence quality lines.)Using the read length information in the FASTQ read header lines, print a count table of read lengths.
Click here for hints
First, select only the read header lines. When doing so, you can assume that
@
s do not occur in other lines.Next, extract the read length numbers in these header lines, which are in the format
length=74
. One way to do this is with a fun trick: by using=
as the delimiter tocut
(-d =
)!
Compare the number you got in the previous question with the number of 35-bp reads from the question before that. What does this seem to tell you? To confirm, can you check what a sample of 35-bp reads look like?
Click here for hints
To see the sequences in the 35-bp reads, use a
grep
option that allows you to see the line after each match along with the matching line. Also, pipegrep
’s output toless
orhead
to avoid having thousands of lines printed to screen.Stage and commit the
README.md
file again.
Bonus
- Name at least two concepts or commands that we have gone over in the course so far, but which you feel you do not (fully) understand. It helps to be specific about what you struggle with.