Graded Assignment 2:
Unix shell data tools and version control
Week 4 graded assignment – due Sunday Sep 21 at 11:59 pm
Overview
This graded assignment is worth 10 points and is due on Sunday Sep 21st. You will practice with Unix shell commands, especially the “data tools” commands, and with Git commands.
Directions and grading
Submission expectations
- Deadline: Sunday September 21st at 11:59 pm.
- Submission: You don’t need to submit anything or notify the instructors: we will simply look for these files in the appropriate location at OSC. (We will also be able to see when the files were last modified.)
- When to start: We recommend that you only get started with this assignment “for real” after Thursday’s class. Many of the questions revolve around commands you’ll have learned by Tuesday, so you could do a partial practice run after that lecture.
Academic integrity
| Use of generative AI Tools (e.g. ChatGPT, Microsoft Copilot, Google Gemini) is not permitted. | |
| Getting help on the assignment is not permitted | |
| Collaborating, or completing the assignment with others, is not permitted | |
| Copying or reusing previous work is not permitted | |
| Open-book research for the assignment is permitted | |
| APA Citations and/or formatting for this assignment are not required |
Rubric
You can earn a maximum of 0.5 points for each of the numbered 20 questions/items below. There is a bonus 21st question that can earn you 1 point if you dropped points elsewhere.
Detailed steps
General
Work in VS Code at OSC as per usual.
Use Unix shell and Git commands to answer all questions unless explicitly instructed not to do so.
In the Markdown file with your answers (see below), copy each of the numbered instructions/questions into your document, and insert your answers in between. The code parts of your answers should be entered in Markdown code blocks, and that code’s output should also be part of your answer. For example:
1. List the files in your current working dir I used the following command: ```bash ls ``` The output was: ``` file1.txt file2.txt annot.gtf ```
Setting up
Create a new dir for this assignment,
/fs/ess/PAS2880/users/$USER/GA2, and navigate there. This will be your working dir for the entire assignment. Inside it, also create a dirdataand a dirresults.In your working dir, create a
README.mdfile and open it in the VS Code editor. Use this file throughout the assigment to add your answers.
Part A: Starting a Git repository
Add a header to your README file to indicate that this file (and dir) is for your assignment.
Initialize a Git repository inside your new directory.
Add & stage and then commit the README file with an appropriate commit message.
Part B: Exploring the GTF file with Unix data tools
Copy the genome annotation GTF file
GCF_016801865.2.gtf.gzthat is in your copy of the Garrigós et al. (2025) data to a filedata/annot.gtf.gz(i.e., into thedatadir you just created).Check the file size of
annot.gtf.gz. Then, decompress the GTF file with thegunzipcommand below. Finally, check the size of the resultingannot.gtffile. How large is the size difference between the compressed and uncompressed files?gunzip data/annot.gtf.gzThe total number of “genomic features” (genes, exons, etc.) listed in a GTF file is simply the number of lines in its main/table section.
- Count the total number of lines in
annot.gtf. - Take a look at the contents of the file and count the number of header lines by eye.
- Based on your knowledge of what the GTF header lines look like, combine
grep -vandwc -lto count the number of genomic features in the file.
- Count the total number of lines in
Does the difference between the two line counts you just performed correspond to the number of header lines you counted visually? If not, can you figure out which other line(s) than the header lines contain the pattern you used with
grepabove?Click here for a hint
Review the “Additional
greptips” box in the section ongrepto find an option that might help you make sense of that.How many distinct “sequences” (chromosome/scaffolds/contigs) are represented in the annotation file? Store a list of these distinct sequences in a file
results/scaffolds.txt.Click here for a hint
In this and all following questions, you’ll only want to consider the table part of the file, so you should start your commands with
grep -vas above.Print a “count table” (like we did towards the end of this section) of feature types in the annotation: this should have a count of the number of times each feature type occurs. (Such a table will for example show how many genes have been annotated in this genome – quite useful!)
Check the status of your Git repository. You should see that your
READMEhas changed but also that Git has detected the files in thedataandresultsdirs. You don’t want to commit any files in thedataandresultsdirs, so create a.gitignorefile that makes Git ignore them. Check the status of your repository again – is your.gitignorefile working as intended?Stage and commit the
.gitignorefile.Stage and commit the
README.mdfile again.
Part C: Exploring the FASTQ files with Unix data tools
Copy the FASTQ file
ERR10802863_R1.fastq.gzof the Garrigós et al. (2025) data to a file of the same name inside the samedatadir you copied the GTF file into.Check the status of your Git repository. Does the FASTQ file show up? Why/why not?
How many reads does the FASTQ file contain?
Click here for hints
Recall from last week’s exercises that a direct linecount of the compressed file will not give you the correct answer.
Also, don’t forget to convert the number of lines to the number of reads.
How many reads in the FASTQ file contain the sequence
ACGT? (You can assume this string does not occur outside of the lines with sequences.)How many reads contain at least 10 consecutive
Ns (uncalled bases)? (You can assume that stretches of N’s do not occur in the sequence quality lines.)Using the read length information in the FASTQ read header lines, print a count table of read lengths.
Click here for hints
First, select only the read header lines. When doing so, you can assume that
@s do not occur in other lines.Next, extract the read length numbers in these header lines, which are in the format
length=74. One way to do this is with a fun trick: by using=as the delimiter tocut(-d =)!
Compare the number you got in the previous question with the number of 35-bp reads from the question before that. What does this seem to tell you? To confirm, can you check what a sample of 35-bp reads look like?
Click here for hints
To see the sequences in the 35-bp reads, use a
grepoption that allows you to see the line after each match along with the matching line. Also, pipegrep’s output tolessorheadto avoid having thousands of lines printed to screen.Stage and commit the
README.mdfile again.
Bonus
- Name at least two concepts or commands that we have gone over in the course so far, but which you feel you do not (fully) understand. It helps to be specific about what you struggle with.