Graded Assignment 4: Slurm batch jobs
Overview
This graded assignment is worth 10 points and is due on Monday Oct 20th. You will practice primarily with Slurm batch jobs.
This assignment assumes that you have done the week 7 exercises, which introduced and used TrimGalore. I you didn’t do these already, we strongly recommend you do so before attempting to do this assignment.
Directions and grading
Submission expectations
- Deadline: Monday Oct 20th at noon (12:00 pm). (You are being given a bit of extra time because of fall break.)
- Submission: You will submit your assignment by tagging the instructor in an Issue in your GitHub as per the last step below.
Academic integrity
For this assignment, you are allowed to search the internet but not to use generative AI (so you cannot use, e.g., Google AI Mode output either).
If you use commands or code constructs we did not cover in class, you must provide the source: webpage or book page, etc. You may also argue that you already knew the command in question from previous self-study, but should then be prepared to answer live questions about it over Zoom. If you don’t provide a source or otherwise explain your usage of such commands, your answer will be considered wrong and not give you any points.
![]() |
Use of generative AI Tools (e.g. ChatGPT, Microsoft Copilot, Google Gemini) is not permitted. |
![]() |
Getting help on the assignment is not permitted |
![]() |
Collaborating, or completing the assignment with others, is not permitted |
![]() |
Copying or reusing previous work is not permitted |
![]() |
Open-book research for the assignment is permitted |
![]() |
APA Citations and/or formatting for this assignment are not required |
Rubric
You can earn a total of 10 points: one point each for questions 6-14 and the Git/GitHub part. The Bonus question can earn you up to an additional point if you lost points elsewhere in the assignment.
Detailed steps
Part A: Setting up & Git
Start a VS Code session at OSC in the folder
/fs/ess/PAS2880/users/$USER
. Create a new dir for this assignment,/fs/ess/PAS2880/users/$USER/GA4
, and switch to that folder in VS Code using the “Open Folder” option. This should be your working dir for the entire assignment.Initialize a Git repository. Commit to the repo throughout the assignment as you see fit, but at least once for each “Part” of this assignment. Use and commit a
.gitignore
file as appropriate. (Tip: Many of you would do well to recall the feedback you got on your Git repo from the previous assignment!)Inside dir
GA4
, create aREADME.md
file and open it in the VS Code editor. Use this file throughout the assigment to add your answers where appropriate.Inside dir
GA4
, also create dirsscripts
andresults
. We’ll make your dir self-contained again (like a typical situation for a research project) by copying some data into it:# (This command will work as intended if you *don't* have a data dir yet) cp -rv ../garrigos-data/fastq data
As a starting script, store the code below in a shell script
scripts/trimgalore.sh
. This is a script to run TrimGalore, which should be very similar to your final script from last week’s exercises.#!/bin/bash set -euo pipefail # Constants TRIMGALORE_CONTAINER=oras://community.wave.seqera.io/library/trim-galore:0.6.10--bc38c9238980c80e # Copy the placeholder variables R1="$1" R2="$2" outdir="$3" # Report echo "# Starting script trimgalore.sh" date echo "# Input R1 FASTQ file: $R1" echo "# Input R2 FASTQ file: $R2" echo "# Output dir: $outdir" echo # Create the output dir mkdir -p "$outdir" # Run TrimGalore apptainer exec "$TRIMGALORE_CONTAINER" \ \ trim_galore --paired \ --fastqc \ --output_dir "$outdir" \ "$R1" \ "$R2" # Report echo echo "# TrimGalore version:" apptainer exec "$TRIMGALORE_CONTAINER" \ -v trim_galore echo "# Successfully finished script trimgalore.sh" date
Part B: A TrimGalore batch job
You’ll start with the TrimGalore script you just created, which should be much like that from last week’s exercises. But this time, instead of running the TrimGalore script “directly” with bash
, you will submit it as a batch job. Then, in the next section, you will submit many batch jobs at the same time: one for each sample.
Add Sbatch options to the top of the TrimGalore shell script to specify:
- The account/project you want to use
- The number of cores you want to reserve: use 8
- The amount of time you want to reserve: use 30 minutes
- The desired file name of Slurm log file
- That Slurm should email you upon job failure
- Optional: you can try other Sbatch options you’d like to test
By printing and scanning through the TrimGalore help info once again (see last week’s exercises), find the TrimGalore option that specifies how many cores it can use – add the relevant line(s) from the TrimGalore help info to your
README.md
. In the script, change thetrim_galore
command accordingly to use the available number of cores.To test the script and batch job submission, submit the script as a batch job only for sample
ERR10802863
.Monitor the job, and when it’s done, check that everything went well (if it didn’t, redo until you get it right). In your
README.md
, explain your monitoring and checking process. Then, remove all outputs (Slurm log files and TrimGalore output files) produced by this test-run.Illumina sequencing uses colors to distinguish between nucleotides as they are being added during the sequencing-by-synthesis process. However, newer Illumina machines (Nextseq and Novaseq) use a different color chemistry than older ones, and this newer chemistry suffers from an artefact that can erroneously produce strings of
G
s (“poly-G”) with high quality scores. ThoseG
s should really beN
s instead, and occur especially at the end of reverse (R2) reads.In the FastQC outputs for the R2 file that you just produced with TrimGalore (recall that it runs FastQC after trimmming!), do you see any evidence for this problem? Explain.
We’ll assume that the data was indeed produced with the newer Illumina color chemistry. In the TrimGalore help info, find the relevant TrimGalore option to deal with the poly-G probelm, and again add the relevant line(s) from the help info to your
README.md
. Then, use the TrimGalore option you found, but don’t change the quality score threshold from the default.Rerun TrimGalore with the added color-chemistry option. Check all outputs and confirm that usage of this option made a difference. Then, remove all outputs produced by this test-run again.
Bonus: Modify the script to rename the output FASTQ files
If you choose not to do this Bonus part, you can simply move on to Part C.
The TrimGalore output FASTQ files are oddly named, ending in _R1_val_1.fq.gz
and _R2_val_2.fq.gz
– check the output files from your initial run to see this. This is not necessarily a problem, but could trip you up in a next step with these files.
Therefore, modify your TrimGalore script to rename the output files after running TrimGalore, giving them the same names as the input files. Then, rerun the script to check that your changes were successful.
Hints (Click to expand)
To rename the files, you first need to extract and store the part of the file name that is basically a sample ID: that before _R1.fastq.gz
/_R1.fastq.gz
. To do that, you will need to use “command substition” and the basename
command. Those were explained on the “Unix shell tips and tricks” (week 6 - Lecture C) page starting here but we did not cover them together in class.
Here is a skeleton of the code to extract that part of the file name:
# Replace " ... " with your code.
sample_id=$( ... )
Based on that ID, you can build up the names of the TrimGalore output files and those of the desired final file names. For example:
# The name of the initial ('init') TrimGalore R1 output file will be:
R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz
See also this newly added section on renaming files with loops, though realize that this situation is different: because you will do the renaming inside the script without a loop.
Part C: TrimGalore batch jobs in a loop
After your test-run, it is time to run TrimGalore on all samples by submitting batch jobs using a loop.
Write a
for
loop in yourREADME.md
to submit a TrimGalore batch job for each pair of FASTQ files that you have in yourdata
dir.Monitor the batch jobs and when they are done, check that everything went well (if it didn’t, redo until you get it right). In your
README.md
, explain your monitoring and checking process. In this case, it is appropriate to keep the Slurm log files: move them into a dirlogs
within the TrimGalore output dir.
Part D: Publish your repo on Github
You’ll publish your Git repo on GitHub and “hand in” your assignment by creating an Issue.
Create a repository on GitHub, connect it to your local repo, and push your local repo to GitHub.
Create a new issue and tag GitHub users
menukabh
andjelmerp
, asking us to take a look at your assignment.