Solutions for Graded Assignment 4

Author
Affiliation

Jelmer Poelstra

Published

October 20, 2025



Part A: Setting up & Git

  1. Setup:

    mkdir /fs/ess/PAS2880/users/$USER/GA4

    Then, click “File” => “Open Folder” in VS Code to open the above folder. When you re-open the terminal, that dir will be your working dir. Avoid navigating anywhere else during the assignment.

  2. Initialize and maintain a Git repo:

    # Initialize a repo
    git init
    
    # For the first commit, may create a Gitignore file:
    echo "results/" > .gitignore
    echo "data/" >> .gitignore
    echo "slurm*" >> .gitignore  # Can be helpful, so these don't keep showing up
    git add .gitignore
    git commit -m "Add a Gitignore file"
    
    # After Part B:
    git add scripts/trimgalore.sh README.md
    git commit -m "Part B"
    
    # After the Bonus section:
    git add scripts/trimgalore.sh README.md
    git commit -m "Bonus"
    
    # After Part C:
    git add scripts/trimgalore.sh README.md
    git commit -m "Part C"
  3. Create a README file:

    touch README.md
    # The open it in the editor
  4. Create two dirs and copy the data:

    mkdir scripts results
    cp -rv ../garrigos-data/fastq data
  5. Create the TrimGalore script:

    touch scripts/trimgalore.sh
    # Then, copy-and-paste the provided code in there

Part B: A TrimGalore batch job

  1. Add Sbatch options to the script:

    #!/bin/bash
    #SBATCH --account=PAS2880
    #SBATCH --cpus-per-task=8
    #SBATCH --time=30
    #SBATCH --output=slurm-trimgalore-%j.out
    #SBATCH --mail-type=FAIL
    set -euo pipefail
  2. Make TrimGalore use the reserved 8 cores:

    • The relevant line from the help info:

      -j/--cores INT          Number of cores to be used for trimming [default: 1]. For Cutadapt to work with multiple cores, it
                         requires Python 3 as well as parallel gzip (pigz) installed on the system. Trim Galore attempts to detect
                         the version of Python used by calling Cutadapt. If Python 2 is detected, --cores is set to 1. If the Python
                         version cannot be detected, Python 3 is assumed and we let Cutadapt handle potential issues itself.
      
                         If pigz cannot be detected on your system, Trim Galore reverts to using gzip compression. Please note
                         that gzip compression will slow down multi-core processes so much that it is hardly worthwhile, please 
                         see: https://github.com/FelixKrueger/TrimGalore/issues/16#issuecomment-458557103 for more info).
      
                         Actual core usage: It should be mentioned that the actual number of cores used is a little convoluted.
                         Assuming that Python 3 is used and pigz is installed, --cores 2 would use 2 cores to read the input
                         (probably not at a high usage though), 2 cores to write to the output (at moderately high usage), and 
                         2 cores for Cutadapt itself + 2 additional cores for Cutadapt (not sure what they are used for) + 1 core
                         for Trim Galore itself. So this can be up to 9 cores, even though most of them won't be used at 100% for
                         most of the time. Paired-end processing uses twice as many cores for the validation (= writing out) step.
                         --cores 4 would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2 (extra Cutadapt) +     1 (Trim Galore) = 15.
      
                         It seems that --cores 4 could be a sweet spot, anything above has diminishing returns.
    • The option to add to you TrimGalore command: --cores 8. Your complete TrimGalore command should then look something like this:

      apptainer exec "$TRIMGALORE_CONTAINER" \
         trim_galore \
         --cores 8 \
         --paired \
         --fastqc \
         --output_dir "$outdir" \
         "$R1" \
         "$R2"
  3. Submit the TrimGalore script for one sample:

    R1=data/ERR10802863_R1.fastq.gz
    R2=${R1/_R1/_R2}
    sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore
  4. Monitor the batch job and clean up:

    • Check the Sbatch queue to see if the job started running and then to check that it is still running at any given point:

      squeue -u $USER -l
      Mon Oct 20 13:42:59 2025
              JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
           37904320       cpu trimgalo   jelmer  RUNNING       0:05     30:00      1 p0016
    • Once the script is no longer listed by squeue, it must have finished. Check for the presence of a Slurm log file, and open and read that file:

      ls
      data  results  scripts  slurm-trimgalore-37904320.out
      less slurm-trimgalore*
      # Starting script trimgalore.sh
      Mon Oct 20 01:42:55 PM EDT 2025
      # Input R1 FASTQ file:      data/ERR10802863_R1.fastq.gz
      # Input R2 FASTQ file:      data/ERR10802863_R2.fastq.gz
      # Output dir:               results/trimgalore
      
      INFO:    Using cached SIF image
      INFO:    gocryptfs not found, will not be able to use gocryptfs
      Path to Cutadapt set as: 'cutadapt' (default)
      Cutadapt seems to be working fine (tested command 'cutadapt --version')
      Cutadapt version: 4.9
      # [...output truncated...]
    • Also check the TrimGalore output files in the specified output dir:

      ls -lh results/trimgalore
      total 43M
      -rw-rw----+ 1 jelmer PAS0471 2.4K Oct 20 13:43 ERR10802863_R1.fastq.gz_trimming_report.txt
      -rw-rw----+ 1 jelmer PAS0471 675K Oct 20 13:43 ERR10802863_R1_val_1_fastqc.html
      -rw-rw----+ 1 jelmer PAS0471 348K Oct 20 13:43 ERR10802863_R1_val_1_fastqc.zip
      -rw-rw----+ 1 jelmer PAS0471  20M Oct 20 13:43 ERR10802863_R1_val_1.fq.gz
      -rw-rw----+ 1 jelmer PAS0471 2.3K Oct 20 13:43 ERR10802863_R2.fastq.gz_trimming_report.txt
      -rw-rw----+ 1 jelmer PAS0471 676K Oct 20 13:43 ERR10802863_R2_val_2_fastqc.html
      -rw-rw----+ 1 jelmer PAS0471 341K Oct 20 13:43 ERR10802863_R2_val_2_fastqc.zip
      -rw-rw----+ 1 jelmer PAS0471  21M Oct 20 13:43 ERR10802863_R2_val_2.fq.gz
    • Clean up:

      rm -r results/trimgalore slurm-trimgalore*
  5. In the FastQC outputs for the trimmed R2 file, do you see any evidence for the “poly-G problem”?

    • See the FastQC output HTML here

    • Yes, in the “Overrepresented sequences” section, a sequence only containing Gs is listed:

  6. Find and use the option to deal with two-color chemistry:

    • The relevant lines in the help info:

      --2colour/--nextseq INT This enables the option '--nextseq-trim=3'CUTOFF' within Cutadapt, which will set a quality
                        cutoff (that is normally given with -q instead), but qualities of G bases are ignored.
                        This trimming is in common for the NextSeq- and NovaSeq-platforms, where basecalls without
                        any signal are called as high-quality G bases. This is mutually exlusive with '-q INT'.
    • The option to add to your TrimGalore command: --2colour 20. We’re using 20 as the quality-score threshold because this is also the default TrimGalore threshold used for “non-two-color” chemistry.

      Your full TrimGalore command should now look something like this:

      apptainer exec "$TRIMGALORE_CONTAINER" \
          trim_galore \
          --2colour 20 \
          --cores 8 \
          --paired \
          --fastqc \
          --output_dir "$outdir" \
          "$R1" \
          "$R2"
  7. Rerun the script:

    • Same code as 8. and 9.

    • Now, no poly-G sequence should be listed in the “Overrepresented sequences” section of the FastQC HTML file for the trimmed R2 FASTQ file.

Bonus: Modify the script to rename the output FASTQ files

  • A section like this should be added to your TrimGalore script, after running TrimGalore:

    # Build the output file names:
    sample_id=$(basename "$R1" _R1.fastq.gz)
    R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz
    R2_out_init="$outdir"/"$sample_id"_R2_val_2.fq.gz
    R1_out_final="$outdir"/"$sample_id"_R1.fastq.gz
    R2_out_final="$outdir"/"$sample_id"_R2.fastq.gz
    
    # Rename the output files:
    echo
    echo "# Renaming the output files:"
    mv -v "$R1_out_init" "$R1_out_final"
    mv -v "$R2_out_init" "$R2_out_final"
  • Rerun the script:

    R1=data/ERR10802863_R1.fastq.gz
    R2=${R1/_R1/_R2}
    sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore
  • Check the output file names:

    ls -lh
    total 42M
    -rw-rw----+ 1 jelmer PAS0471  20M Oct 20 13:57 ERR10802863_R1.fastq.gz
    -rw-rw----+ 1 jelmer PAS0471 2.4K Oct 20 13:57 ERR10802863_R1.fastq.gz_trimming_report.txt
    -rw-rw----+ 1 jelmer PAS0471 679K Oct 20 13:57 ERR10802863_R1_val_1_fastqc.html
    -rw-rw----+ 1 jelmer PAS0471 356K Oct 20 13:57 ERR10802863_R1_val_1_fastqc.zip
    -rw-rw----+ 1 jelmer PAS0471  20M Oct 20 13:57 ERR10802863_R2.fastq.gz
    -rw-rw----+ 1 jelmer PAS0471 2.3K Oct 20 13:57 ERR10802863_R2.fastq.gz_trimming_report.txt
    -rw-rw----+ 1 jelmer PAS0471 681K Oct 20 13:57 ERR10802863_R2_val_2_fastqc.html
    -rw-rw----+ 1 jelmer PAS0471 345K Oct 20 13:57 ERR10802863_R2_val_2_fastqc.zip

Part C: TrimGalore batch jobs in a loop

  1. Use a for loop to submit a batch job for each sample:

    for R1 in data/*R1.fastq.gz; do
        R2=${R1/_R1/_R2}
        sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore
    done
  2. Monitor the batch jobs – mostly the same procedure as for 9., except:

    • You won’t want to read every single Slurm log file. Instead, you could e.g. read one or two, and then print and check the last few lines of all files with tail:

      tail slurm-trimgalore*
    • Checking your email for any failure emails from Slurm is especially a good idea when you’re running many jobs and might otherwise overlook the failure of one of them.

    • This time, move (don’t remove) the Slurm log files:

      mkdir results/trimgalore/logs
      mv slurm-trimgalore* results/trimgalore/logs

Part D: Publish your repo on Github

  1. Connect your remote repo and push to it:

    # After creating the repo on the GitHub website, connect it, e.g.:
    git remote add <URL>
    # Then push the local repo to the remote:
    git push -u origin main

The final TrimGalore script (including Bonus question)

#!/bin/bash
#SBATCH --account=PAS2880
#SBATCH --cpus-per-task=8
#SBATCH --time=30
#SBATCH --output=slurm-trimgalore-%j.out
#SBATCH --mail-type=FAIL
set -euo pipefail
   
# Constants
TRIMGALORE_CONTAINER=oras://community.wave.seqera.io/library/trim-galore:0.6.10--bc38c9238980c80e
   
# Copy the placeholder variables
R1="$1"
R2="$2"
outdir="$3"
   
# Report
echo "# Starting script trimgalore.sh"
date
echo "# Input R1 FASTQ file:      $R1"
echo "# Input R2 FASTQ file:      $R2"
echo "# Output dir:               $outdir"
echo
   
# Create the output dir
mkdir -p "$outdir"
   
# Run TrimGalore
apptainer exec "$TRIMGALORE_CONTAINER" \
    trim_galore \
    --2colour 20 \
    --cores 8 \
    --paired \
    --fastqc \
    --output_dir "$outdir" \
    "$R1" \
    "$R2"

# Build the output file names:
sample_id=$(basename "$R1" _R1.fastq.gz)
R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz
R2_out_init="$outdir"/"$sample_id"_R2_val_2.fq.gz
R1_out_final="$outdir"/"$sample_id"_R1.fastq.gz
R2_out_final="$outdir"/"$sample_id"_R2.fastq.gz
  
# Rename the output files:
echo
echo "# Renaming the output files:"
mv -v "$R1_out_init" "$R1_out_final"
mv -v "$R2_out_init" "$R2_out_final"
  
# Report
echo
echo "# TrimGalore version:"
apptainer exec "$TRIMGALORE_CONTAINER" \
    trim_galore -v
echo "# Successfully finished script trimgalore.sh"
date
Back to top