Solutions for Graded Assignment 4

Author

Affiliation

Jelmer Poelstra

CFAES Bioinformatics Core, Ohio State University

Published

October 20, 2025

Part A: Setting up & Git

Setup:
```
mkdir /fs/ess/PAS2880/users/$USER/GA4
```
Then, click “File” => “Open Folder” in VS Code to open the above folder. When you re-open the terminal, that dir will be your working dir. Avoid navigating anywhere else during the assignment.

Initialize and maintain a Git repo:

# Initialize a repo
git init

# For the first commit, may create a Gitignore file:
echo "results/" > .gitignore
echo "data/" >> .gitignore
echo "slurm*" >> .gitignore  # Can be helpful, so these don't keep showing up
git add .gitignore
git commit -m "Add a Gitignore file"

# After Part B:
git add scripts/trimgalore.sh README.md
git commit -m "Part B"

# After the Bonus section:
git add scripts/trimgalore.sh README.md
git commit -m "Bonus"

# After Part C:
git add scripts/trimgalore.sh README.md
git commit -m "Part C"

Create a README file:

touch README.md
# The open it in the editor

Create two dirs and copy the data:

mkdir scripts results
cp -rv ../garrigos-data/fastq data

Create the TrimGalore script:

touch scripts/trimgalore.sh
# Then, copy-and-paste the provided code in there

Part B: A TrimGalore batch job

Add Sbatch options to the script:

#!/bin/bash
#SBATCH --account=PAS2880
#SBATCH --cpus-per-task=8
#SBATCH --time=30
#SBATCH --output=slurm-trimgalore-%j.out
#SBATCH --mail-type=FAIL
set -euo pipefail

Make TrimGalore use the reserved 8 cores:

The relevant line from the help info:

-j/--cores INT          Number of cores to be used for trimming [default: 1]. For Cutadapt to work with multiple cores, it
                   requires Python 3 as well as parallel gzip (pigz) installed on the system. Trim Galore attempts to detect
                   the version of Python used by calling Cutadapt. If Python 2 is detected, --cores is set to 1. If the Python
                   version cannot be detected, Python 3 is assumed and we let Cutadapt handle potential issues itself.

                   If pigz cannot be detected on your system, Trim Galore reverts to using gzip compression. Please note
                   that gzip compression will slow down multi-core processes so much that it is hardly worthwhile, please 
                   see: https://github.com/FelixKrueger/TrimGalore/issues/16#issuecomment-458557103 for more info).

                   Actual core usage: It should be mentioned that the actual number of cores used is a little convoluted.
                   Assuming that Python 3 is used and pigz is installed, --cores 2 would use 2 cores to read the input
                   (probably not at a high usage though), 2 cores to write to the output (at moderately high usage), and 
                   2 cores for Cutadapt itself + 2 additional cores for Cutadapt (not sure what they are used for) + 1 core
                   for Trim Galore itself. So this can be up to 9 cores, even though most of them won't be used at 100% for
                   most of the time. Paired-end processing uses twice as many cores for the validation (= writing out) step.
                   --cores 4 would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2 (extra Cutadapt) +     1 (Trim Galore) = 15.

                   It seems that --cores 4 could be a sweet spot, anything above has diminishing returns.

The option to add to you TrimGalore command: --cores 8. Your complete TrimGalore command should then look something like this:

apptainer exec "$TRIMGALORE_CONTAINER" \
   trim_galore \
   --cores 8 \
   --paired \
   --fastqc \
   --output_dir "$outdir" \
   "$R1" \
   "$R2"

Submit the TrimGalore script for one sample:

R1=data/ERR10802863_R1.fastq.gz
R2=${R1/_R1/_R2}
sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore

Monitor the batch job and clean up:

Check the Sbatch queue to see if the job started running and then to check that it is still running at any given point:

squeue -u $USER -l

Mon Oct 20 13:42:59 2025
        JOBID PARTITION     NAME     USER    STATE       TIME TIME_LIMI  NODES NODELIST(REASON)
     37904320       cpu trimgalo   jelmer  RUNNING       0:05     30:00      1 p0016

Once the script is no longer listed by squeue, it must have finished. Check for the presence of a Slurm log file, and open and read that file:

ls

data  results  scripts  slurm-trimgalore-37904320.out

less slurm-trimgalore*

# Starting script trimgalore.sh
Mon Oct 20 01:42:55 PM EDT 2025
# Input R1 FASTQ file:      data/ERR10802863_R1.fastq.gz
# Input R2 FASTQ file:      data/ERR10802863_R2.fastq.gz
# Output dir:               results/trimgalore

INFO:    Using cached SIF image
INFO:    gocryptfs not found, will not be able to use gocryptfs
Path to Cutadapt set as: 'cutadapt' (default)
Cutadapt seems to be working fine (tested command 'cutadapt --version')
Cutadapt version: 4.9
# [...output truncated...]

Also check the TrimGalore output files in the specified output dir:

ls -lh results/trimgalore

total 43M
-rw-rw----+ 1 jelmer PAS0471 2.4K Oct 20 13:43 ERR10802863_R1.fastq.gz_trimming_report.txt
-rw-rw----+ 1 jelmer PAS0471 675K Oct 20 13:43 ERR10802863_R1_val_1_fastqc.html
-rw-rw----+ 1 jelmer PAS0471 348K Oct 20 13:43 ERR10802863_R1_val_1_fastqc.zip
-rw-rw----+ 1 jelmer PAS0471  20M Oct 20 13:43 ERR10802863_R1_val_1.fq.gz
-rw-rw----+ 1 jelmer PAS0471 2.3K Oct 20 13:43 ERR10802863_R2.fastq.gz_trimming_report.txt
-rw-rw----+ 1 jelmer PAS0471 676K Oct 20 13:43 ERR10802863_R2_val_2_fastqc.html
-rw-rw----+ 1 jelmer PAS0471 341K Oct 20 13:43 ERR10802863_R2_val_2_fastqc.zip
-rw-rw----+ 1 jelmer PAS0471  21M Oct 20 13:43 ERR10802863_R2_val_2.fq.gz

Clean up:

rm -r results/trimgalore slurm-trimgalore*

In the FastQC outputs for the trimmed R2 file, do you see any evidence for the “poly-G problem”?
- See the FastQC output HTML here
- Yes, in the “Overrepresented sequences” section, a sequence only containing Gs is listed:

Find and use the option to deal with two-color chemistry:

The relevant lines in the help info:

--2colour/--nextseq INT This enables the option '--nextseq-trim=3'CUTOFF' within Cutadapt, which will set a quality
                  cutoff (that is normally given with -q instead), but qualities of G bases are ignored.
                  This trimming is in common for the NextSeq- and NovaSeq-platforms, where basecalls without
                  any signal are called as high-quality G bases. This is mutually exlusive with '-q INT'.

The option to add to your TrimGalore command: --2colour 20. We’re using 20 as the quality-score threshold because this is also the default TrimGalore threshold used for “non-two-color” chemistry.

Your full TrimGalore command should now look something like this:
```
apptainer exec "$TRIMGALORE_CONTAINER" \
    trim_galore \
    --2colour 20 \
    --cores 8 \
    --paired \
    --fastqc \
    --output_dir "$outdir" \
    "$R1" \
    "$R2"
```

Rerun the script:
- Same code as 8. and 9.
- Now, no poly-G sequence should be listed in the “Overrepresented sequences” section of the FastQC HTML file for the trimmed R2 FASTQ file.

Bonus: Modify the script to rename the output FASTQ files

A section like this should be added to your TrimGalore script, after running TrimGalore:

# Build the output file names:
sample_id=$(basename "$R1" _R1.fastq.gz)
R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz
R2_out_init="$outdir"/"$sample_id"_R2_val_2.fq.gz
R1_out_final="$outdir"/"$sample_id"_R1.fastq.gz
R2_out_final="$outdir"/"$sample_id"_R2.fastq.gz

# Rename the output files:
echo
echo "# Renaming the output files:"
mv -v "$R1_out_init" "$R1_out_final"
mv -v "$R2_out_init" "$R2_out_final"

Rerun the script:

R1=data/ERR10802863_R1.fastq.gz
R2=${R1/_R1/_R2}
sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore

Check the output file names:

ls -lh

total 42M
-rw-rw----+ 1 jelmer PAS0471  20M Oct 20 13:57 ERR10802863_R1.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 2.4K Oct 20 13:57 ERR10802863_R1.fastq.gz_trimming_report.txt
-rw-rw----+ 1 jelmer PAS0471 679K Oct 20 13:57 ERR10802863_R1_val_1_fastqc.html
-rw-rw----+ 1 jelmer PAS0471 356K Oct 20 13:57 ERR10802863_R1_val_1_fastqc.zip
-rw-rw----+ 1 jelmer PAS0471  20M Oct 20 13:57 ERR10802863_R2.fastq.gz
-rw-rw----+ 1 jelmer PAS0471 2.3K Oct 20 13:57 ERR10802863_R2.fastq.gz_trimming_report.txt
-rw-rw----+ 1 jelmer PAS0471 681K Oct 20 13:57 ERR10802863_R2_val_2_fastqc.html
-rw-rw----+ 1 jelmer PAS0471 345K Oct 20 13:57 ERR10802863_R2_val_2_fastqc.zip

Part C: TrimGalore batch jobs in a loop

Use a for loop to submit a batch job for each sample:

for R1 in data/*R1.fastq.gz; do
    R2=${R1/_R1/_R2}
    sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore
done

Monitor the batch jobs – mostly the same procedure as for 9., except:
- You won’t want to read every single Slurm log file. Instead, you could e.g. read one or two, and then print and check the last few lines of all files with tail:
```
tail slurm-trimgalore*
```
- Checking your email for any failure emails from Slurm is especially a good idea when you’re running many jobs and might otherwise overlook the failure of one of them.
- This time, move (don’t remove) the Slurm log files:
```
mkdir results/trimgalore/logs
mv slurm-trimgalore* results/trimgalore/logs
```

Part D: Publish your repo on Github

Connect your remote repo and push to it:

# After creating the repo on the GitHub website, connect it, e.g.:
git remote add <URL>
# Then push the local repo to the remote:
git push -u origin main

The final TrimGalore script (including Bonus question)

#!/bin/bash
#SBATCH --account=PAS2880
#SBATCH --cpus-per-task=8
#SBATCH --time=30
#SBATCH --output=slurm-trimgalore-%j.out
#SBATCH --mail-type=FAIL
set -euo pipefail
   
# Constants
TRIMGALORE_CONTAINER=oras://community.wave.seqera.io/library/trim-galore:0.6.10--bc38c9238980c80e
   
# Copy the placeholder variables
R1="$1"
R2="$2"
outdir="$3"
   
# Report
echo "# Starting script trimgalore.sh"
date
echo "# Input R1 FASTQ file:      $R1"
echo "# Input R2 FASTQ file:      $R2"
echo "# Output dir:               $outdir"
echo
   
# Create the output dir
mkdir -p "$outdir"
   
# Run TrimGalore
apptainer exec "$TRIMGALORE_CONTAINER" \
    trim_galore \
    --2colour 20 \
    --cores 8 \
    --paired \
    --fastqc \
    --output_dir "$outdir" \
    "$R1" \
    "$R2"

# Build the output file names:
sample_id=$(basename "$R1" _R1.fastq.gz)
R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz
R2_out_init="$outdir"/"$sample_id"_R2_val_2.fq.gz
R1_out_final="$outdir"/"$sample_id"_R1.fastq.gz
R2_out_final="$outdir"/"$sample_id"_R2.fastq.gz
  
# Rename the output files:
echo
echo "# Renaming the output files:"
mv -v "$R1_out_init" "$R1_out_final"
mv -v "$R2_out_init" "$R2_out_final"
  
# Report
echo
echo "# TrimGalore version:"
apptainer exec "$TRIMGALORE_CONTAINER" \
    trim_galore -v
echo "# Successfully finished script trimgalore.sh"
date