Solutions for Graded Assignment 4
Part A: Setting up & Git
Setup:
mkdir /fs/ess/PAS2880/users/$USER/GA4Then, click “File” => “Open Folder” in VS Code to open the above folder. When you re-open the terminal, that dir will be your working dir. Avoid navigating anywhere else during the assignment.
Initialize and maintain a Git repo:
# Initialize a repo git init # For the first commit, may create a Gitignore file: echo "results/" > .gitignore echo "data/" >> .gitignore echo "slurm*" >> .gitignore # Can be helpful, so these don't keep showing up git add .gitignore git commit -m "Add a Gitignore file" # After Part B: git add scripts/trimgalore.sh README.md git commit -m "Part B" # After the Bonus section: git add scripts/trimgalore.sh README.md git commit -m "Bonus" # After Part C: git add scripts/trimgalore.sh README.md git commit -m "Part C"Create a README file:
touch README.md # The open it in the editorCreate two dirs and copy the data:
mkdir scripts results cp -rv ../garrigos-data/fastq dataCreate the TrimGalore script:
touch scripts/trimgalore.sh # Then, copy-and-paste the provided code in there
Part B: A TrimGalore batch job
Add Sbatch options to the script:
#!/bin/bash #SBATCH --account=PAS2880 #SBATCH --cpus-per-task=8 #SBATCH --time=30 #SBATCH --output=slurm-trimgalore-%j.out #SBATCH --mail-type=FAIL set -euo pipefailMake TrimGalore use the reserved 8 cores:
The relevant line from the help info:
-j/--cores INT Number of cores to be used for trimming [default: 1]. For Cutadapt to work with multiple cores, it requires Python 3 as well as parallel gzip (pigz) installed on the system. Trim Galore attempts to detect the version of Python used by calling Cutadapt. If Python 2 is detected, --cores is set to 1. If the Python version cannot be detected, Python 3 is assumed and we let Cutadapt handle potential issues itself. If pigz cannot be detected on your system, Trim Galore reverts to using gzip compression. Please note that gzip compression will slow down multi-core processes so much that it is hardly worthwhile, please see: https://github.com/FelixKrueger/TrimGalore/issues/16#issuecomment-458557103 for more info). Actual core usage: It should be mentioned that the actual number of cores used is a little convoluted. Assuming that Python 3 is used and pigz is installed, --cores 2 would use 2 cores to read the input (probably not at a high usage though), 2 cores to write to the output (at moderately high usage), and 2 cores for Cutadapt itself + 2 additional cores for Cutadapt (not sure what they are used for) + 1 core for Trim Galore itself. So this can be up to 9 cores, even though most of them won't be used at 100% for most of the time. Paired-end processing uses twice as many cores for the validation (= writing out) step. --cores 4 would then be: 4 (read) + 4 (write) + 4 (Cutadapt) + 2 (extra Cutadapt) + 1 (Trim Galore) = 15. It seems that --cores 4 could be a sweet spot, anything above has diminishing returns.The option to add to you TrimGalore command:
--cores 8. Your complete TrimGalore command should then look something like this:apptainer exec "$TRIMGALORE_CONTAINER" \ trim_galore \ --cores 8 \ --paired \ --fastqc \ --output_dir "$outdir" \ "$R1" \ "$R2"
Submit the TrimGalore script for one sample:
R1=data/ERR10802863_R1.fastq.gz R2=${R1/_R1/_R2} sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgaloreMonitor the batch job and clean up:
Check the Sbatch queue to see if the job started running and then to check that it is still running at any given point:
squeue -u $USER -lMon Oct 20 13:42:59 2025 JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) 37904320 cpu trimgalo jelmer RUNNING 0:05 30:00 1 p0016Once the script is no longer listed by
squeue, it must have finished. Check for the presence of a Slurm log file, and open and read that file:lsdata results scripts slurm-trimgalore-37904320.outless slurm-trimgalore*# Starting script trimgalore.sh Mon Oct 20 01:42:55 PM EDT 2025 # Input R1 FASTQ file: data/ERR10802863_R1.fastq.gz # Input R2 FASTQ file: data/ERR10802863_R2.fastq.gz # Output dir: results/trimgalore INFO: Using cached SIF image INFO: gocryptfs not found, will not be able to use gocryptfs Path to Cutadapt set as: 'cutadapt' (default) Cutadapt seems to be working fine (tested command 'cutadapt --version') Cutadapt version: 4.9 # [...output truncated...]Also check the TrimGalore output files in the specified output dir:
ls -lh results/trimgaloretotal 43M -rw-rw----+ 1 jelmer PAS0471 2.4K Oct 20 13:43 ERR10802863_R1.fastq.gz_trimming_report.txt -rw-rw----+ 1 jelmer PAS0471 675K Oct 20 13:43 ERR10802863_R1_val_1_fastqc.html -rw-rw----+ 1 jelmer PAS0471 348K Oct 20 13:43 ERR10802863_R1_val_1_fastqc.zip -rw-rw----+ 1 jelmer PAS0471 20M Oct 20 13:43 ERR10802863_R1_val_1.fq.gz -rw-rw----+ 1 jelmer PAS0471 2.3K Oct 20 13:43 ERR10802863_R2.fastq.gz_trimming_report.txt -rw-rw----+ 1 jelmer PAS0471 676K Oct 20 13:43 ERR10802863_R2_val_2_fastqc.html -rw-rw----+ 1 jelmer PAS0471 341K Oct 20 13:43 ERR10802863_R2_val_2_fastqc.zip -rw-rw----+ 1 jelmer PAS0471 21M Oct 20 13:43 ERR10802863_R2_val_2.fq.gzClean up:
rm -r results/trimgalore slurm-trimgalore*
In the FastQC outputs for the trimmed R2 file, do you see any evidence for the “poly-G problem”?
See the FastQC output HTML here
Yes, in the “Overrepresented sequences” section, a sequence only containing
Gs is listed:
Find and use the option to deal with two-color chemistry:
The relevant lines in the help info:
--2colour/--nextseq INT This enables the option '--nextseq-trim=3'CUTOFF' within Cutadapt, which will set a quality cutoff (that is normally given with -q instead), but qualities of G bases are ignored. This trimming is in common for the NextSeq- and NovaSeq-platforms, where basecalls without any signal are called as high-quality G bases. This is mutually exlusive with '-q INT'.The option to add to your TrimGalore command:
--2colour 20. We’re using 20 as the quality-score threshold because this is also the default TrimGalore threshold used for “non-two-color” chemistry.Your full TrimGalore command should now look something like this:
apptainer exec "$TRIMGALORE_CONTAINER" \ trim_galore \ --2colour 20 \ --cores 8 \ --paired \ --fastqc \ --output_dir "$outdir" \ "$R1" \ "$R2"
Rerun the script:
Same code as 8. and 9.
Now, no poly-G sequence should be listed in the “Overrepresented sequences” section of the FastQC HTML file for the trimmed R2 FASTQ file.
Bonus: Modify the script to rename the output FASTQ files
A section like this should be added to your TrimGalore script, after running TrimGalore:
# Build the output file names: sample_id=$(basename "$R1" _R1.fastq.gz) R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz R2_out_init="$outdir"/"$sample_id"_R2_val_2.fq.gz R1_out_final="$outdir"/"$sample_id"_R1.fastq.gz R2_out_final="$outdir"/"$sample_id"_R2.fastq.gz # Rename the output files: echo echo "# Renaming the output files:" mv -v "$R1_out_init" "$R1_out_final" mv -v "$R2_out_init" "$R2_out_final"Rerun the script:
R1=data/ERR10802863_R1.fastq.gz R2=${R1/_R1/_R2} sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgaloreCheck the output file names:
ls -lhtotal 42M -rw-rw----+ 1 jelmer PAS0471 20M Oct 20 13:57 ERR10802863_R1.fastq.gz -rw-rw----+ 1 jelmer PAS0471 2.4K Oct 20 13:57 ERR10802863_R1.fastq.gz_trimming_report.txt -rw-rw----+ 1 jelmer PAS0471 679K Oct 20 13:57 ERR10802863_R1_val_1_fastqc.html -rw-rw----+ 1 jelmer PAS0471 356K Oct 20 13:57 ERR10802863_R1_val_1_fastqc.zip -rw-rw----+ 1 jelmer PAS0471 20M Oct 20 13:57 ERR10802863_R2.fastq.gz -rw-rw----+ 1 jelmer PAS0471 2.3K Oct 20 13:57 ERR10802863_R2.fastq.gz_trimming_report.txt -rw-rw----+ 1 jelmer PAS0471 681K Oct 20 13:57 ERR10802863_R2_val_2_fastqc.html -rw-rw----+ 1 jelmer PAS0471 345K Oct 20 13:57 ERR10802863_R2_val_2_fastqc.zip
Part C: TrimGalore batch jobs in a loop
Use a
forloop to submit a batch job for each sample:for R1 in data/*R1.fastq.gz; do R2=${R1/_R1/_R2} sbatch scripts/trimgalore.sh "$R1" "$R2" results/trimgalore doneMonitor the batch jobs – mostly the same procedure as for 9., except:
You won’t want to read every single Slurm log file. Instead, you could e.g. read one or two, and then print and check the last few lines of all files with
tail:tail slurm-trimgalore*Checking your email for any failure emails from Slurm is especially a good idea when you’re running many jobs and might otherwise overlook the failure of one of them.
This time, move (don’t remove) the Slurm log files:
mkdir results/trimgalore/logs mv slurm-trimgalore* results/trimgalore/logs
Part D: Publish your repo on Github
Connect your remote repo and push to it:
# After creating the repo on the GitHub website, connect it, e.g.: git remote add <URL> # Then push the local repo to the remote: git push -u origin main
The final TrimGalore script (including Bonus question)
#!/bin/bash
#SBATCH --account=PAS2880
#SBATCH --cpus-per-task=8
#SBATCH --time=30
#SBATCH --output=slurm-trimgalore-%j.out
#SBATCH --mail-type=FAIL
set -euo pipefail
# Constants
TRIMGALORE_CONTAINER=oras://community.wave.seqera.io/library/trim-galore:0.6.10--bc38c9238980c80e
# Copy the placeholder variables
R1="$1"
R2="$2"
outdir="$3"
# Report
echo "# Starting script trimgalore.sh"
date
echo "# Input R1 FASTQ file: $R1"
echo "# Input R2 FASTQ file: $R2"
echo "# Output dir: $outdir"
echo
# Create the output dir
mkdir -p "$outdir"
# Run TrimGalore
apptainer exec "$TRIMGALORE_CONTAINER" \
trim_galore \
--2colour 20 \
--cores 8 \
--paired \
--fastqc \
--output_dir "$outdir" \
"$R1" \
"$R2"
# Build the output file names:
sample_id=$(basename "$R1" _R1.fastq.gz)
R1_out_init="$outdir"/"$sample_id"_R1_val_1.fq.gz
R2_out_init="$outdir"/"$sample_id"_R2_val_2.fq.gz
R1_out_final="$outdir"/"$sample_id"_R1.fastq.gz
R2_out_final="$outdir"/"$sample_id"_R2.fastq.gz
# Rename the output files:
echo
echo "# Renaming the output files:"
mv -v "$R1_out_init" "$R1_out_final"
mv -v "$R2_out_init" "$R2_out_final"
# Report
echo
echo "# TrimGalore version:"
apptainer exec "$TRIMGALORE_CONTAINER" \
trim_galore -v
echo "# Successfully finished script trimgalore.sh"
date