Graded Assignment 4: Slurm batch jobs

Author
Affiliation

Jelmer Poelstra

Published

October 11, 2025



Overview

This graded assignment is worth 10 points and is due on Monday Oct 20th. You will practice primarily with Slurm batch jobs.

Assumed knowledge

This assignment assumes that you have done the week 7 exercises, which introduced and used TrimGalore. I you didn’t do these already, we strongly recommend you do so before attempting to do this assignment.

Directions and grading

Submission expectations

  • Deadline: Monday Oct 20th at noon (12:00 pm). (You are being given a bit of extra time because of fall break.)
  • Submission: You will submit your assignment by tagging the instructor in an Issue in your GitHub as per the last step below.

Academic integrity

Gen-AI check

For this assignment, you are allowed to search the internet but not to use generative AI (so you cannot use, e.g., Google AI Mode output either).

If you use commands or code constructs we did not cover in class, you must provide the source: webpage or book page, etc. You may also argue that you already knew the command in question from previous self-study, but should then be prepared to answer live questions about it over Zoom. If you don’t provide a source or otherwise explain your usage of such commands, your answer will be considered wrong and not give you any points.


Use of generative AI Tools (e.g. ChatGPT, Microsoft Copilot, Google Gemini) is not permitted.
Getting help on the assignment is not permitted
Collaborating, or completing the assignment with others, is not permitted
Copying or reusing previous work is not permitted
Open-book research for the assignment is permitted
APA Citations and/or formatting for this assignment are not required

Rubric

You can earn a total of 10 points: one point each for questions 6-14 and the Git/GitHub part. The Bonus question can earn you up to an additional point if you lost points elsewhere in the assignment.

Detailed steps

Part A: Setting up & Git

  1. Start a VS Code session at OSC in the folder /fs/ess/PAS2880/users/$USER. Create a new dir for this assignment, /fs/ess/PAS2880/users/$USER/GA4, and switch to that folder in VS Code using the “Open Folder” option. This should be your working dir for the entire assignment.

  2. Initialize a Git repository. Commit to the repo throughout the assignment as you see fit, but at least once for each “Part” of this assignment. Use and commit a .gitignore file as appropriate. (Tip: Many of you would do well to recall the feedback you got on your Git repo from the previous assignment!)

  3. Inside dir GA4, create a README.md file and open it in the VS Code editor. Use this file throughout the assigment to add your answers where appropriate.

  4. Inside dir GA4, also create dirs scripts and results. We’ll make your dir self-contained again (like a typical situation for a research project) by copying some data into it:

    # (This command will work as intended if you *don't* have a data dir yet) 
    cp -rv ../garrigos-data/fastq data
  5. As a starting script, store the code below in a shell script scripts/trimgalore.sh. This is a script to run TrimGalore, which should be very similar to your final script from last week’s exercises.

    #!/bin/bash
    set -euo pipefail
    
    # Constants
    TRIMGALORE_CONTAINER=oras://community.wave.seqera.io/library/trim-galore:0.6.10--bc38c9238980c80e
    
    # Copy the placeholder variables
    R1="$1"
    R2="$2"
    outdir="$3"
    
    # Report
    echo "# Starting script trimgalore.sh"
    date
    echo "# Input R1 FASTQ file:      $R1"
    echo "# Input R2 FASTQ file:      $R2"
    echo "# Output dir:               $outdir"
    echo
    
    # Create the output dir
    mkdir -p "$outdir"
    
    # Run TrimGalore
    apptainer exec "$TRIMGALORE_CONTAINER" \
        trim_galore \
        --paired \
        --fastqc \
        --output_dir "$outdir" \
        "$R1" \
        "$R2"
    
    # Report
    echo
    echo "# TrimGalore version:"
    apptainer exec "$TRIMGALORE_CONTAINER" \
      trim_galore -v
    echo "# Successfully finished script trimgalore.sh"
    date

Part B: A TrimGalore batch job

You’ll start with the TrimGalore script you just created, which should be much like that from last week’s exercises. But this time, instead of running the TrimGalore script “directly” with bash, you will submit it as a batch job. Then, in the next section, you will submit many batch jobs at the same time: one for each sample.

  1. Add Sbatch options to the top of the TrimGalore shell script to specify:

    • The account/project you want to use
    • The number of cores you want to reserve: use 8
    • The amount of time you want to reserve: use 30 minutes
    • The desired file name of Slurm log file
    • That Slurm should email you upon job failure
    • Optional: you can try other Sbatch options you’d like to test
  2. By printing and scanning through the TrimGalore help info once again (see last week’s exercises), find the TrimGalore option that specifies how many cores it can use – add the relevant line(s) from the TrimGalore help info to your README.md. In the script, change the trim_galore command accordingly to use the available number of cores.

  3. To test the script and batch job submission, submit the script as a batch job only for sample ERR10802863.

  4. Monitor the job, and when it’s done, check that everything went well (if it didn’t, redo until you get it right). In your README.md, explain your monitoring and checking process. Then, remove all outputs (Slurm log files and TrimGalore output files) produced by this test-run.

  5. Illumina sequencing uses colors to distinguish between nucleotides as they are being added during the sequencing-by-synthesis process. However, newer Illumina machines (Nextseq and Novaseq) use a different color chemistry than older ones, and this newer chemistry suffers from an artefact that can erroneously produce strings of Gs (“poly-G”) with high quality scores. Those Gs should really be Ns instead, and occur especially at the end of reverse (R2) reads.

    In the FastQC outputs for the R2 file that you just produced with TrimGalore (recall that it runs FastQC after trimmming!), do you see any evidence for this problem? Explain.

  6. We’ll assume that the data was indeed produced with the newer Illumina color chemistry. In the TrimGalore help info, find the relevant TrimGalore option to deal with the poly-G probelm, and again add the relevant line(s) from the help info to your README.md. Then, use the TrimGalore option you found, but don’t change the quality score threshold from the default.

  7. Rerun TrimGalore with the added color-chemistry option. Check all outputs and confirm that usage of this option made a difference. Then, remove all outputs produced by this test-run again.

Bonus: Modify the script to rename the output FASTQ files

If you choose not to do this Bonus part, you can simply move on to Part C.

The TrimGalore output FASTQ files are oddly named, ending in _R1_val_1.fq.gz and _R2_val_2.fq.gz – check the output files from your initial run to see this. This is not necessarily a problem, but could trip you up in a next step with these files.

Therefore, modify your TrimGalore script to rename the output files after running TrimGalore, giving them the same names as the input files. Then, rerun the script to check that your changes were successful.

Hints (Click to expand)

To rename the files, you first need to extract and store the part of the file name that is basically a sample ID: that before _R1.fastq.gz/_R1.fastq.gz. To do that, you will need to use “command substition” and the basename command. Those were explained on the “Unix shell tips and tricks” (week 6 - Lecture C) page starting here but we did not cover them together in class.

Here is a skeleton of the code to extract that part of the file name:

# Replace " ... " with your code.
sample_id=$( ... )

Based on that ID, you can build up the names of the TrimGalore output files and those of the desired final file names. For example:

# The name of the initial ('init') TrimGalore R1 output file will be:
R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz

See also this newly added section on renaming files with loops, though realize that this situation is different: because you will do the renaming inside the script without a loop.

Part C: TrimGalore batch jobs in a loop

After your test-run, it is time to run TrimGalore on all samples by submitting batch jobs using a loop.

  1. Write a for loop in your README.md to submit a TrimGalore batch job for each pair of FASTQ files that you have in your data dir.

  2. Monitor the batch jobs and when they are done, check that everything went well (if it didn’t, redo until you get it right). In your README.md, explain your monitoring and checking process. In this case, it is appropriate to keep the Slurm log files: move them into a dir logs within the TrimGalore output dir.

Part D: Publish your repo on Github

You’ll publish your Git repo on GitHub and “hand in” your assignment by creating an Issue.

  1. Create a repository on GitHub, connect it to your local repo, and push your local repo to GitHub.

  2. Create a new issue and tag GitHub users menukabh and jelmerp, asking us to take a look at your assignment.

Back to top