If you did last week’s exercises, much of the following introduction will be familiar to you, as we will be working with the same FASTQ files and looking at the same primers. This time around, you will actually remove the primer sequences using the software Cutadapt.
This assignment will work with 6 FASTQ files with sequences from the V4 region of 16S rRNA, generated in a metabarcoding experiment.
The FASTQ files come in pairs: for every sample, there is a FASTQ file with forward reads (or “read 1” reads) that contains _R1_
in its file name, and a FASTQ file with corresponding reverse reads (or “read 2” reads) that contains _R2_
in its file name. So, our 6 FASTQ files consist of 3 pairs of R1
-R2
files for 3 biological samples.
The sequences were generated by first amplifying environmental samples with a pair of universal 16S primers, and these primer sequences are expected to be present in the FASTQ sequences. You will remove these primer sequences with the program Cutadapt, and there are two things to be aware of:
R
means that the site can be either an A
or a G
, and an N
means that the site can be any of the four bases. See here for a complete overview of these ambiguity codes.These are the primer sequences1:
GAGTGYCAGCMGCCGCGGTAA
.ACGGACTACNVGGGTWTCTAAT
.For each numbered step below, you should create at least one Git commit.
All files should be added to your repository, unless mentioned otherwise.
Create a new directory for this assignment, and inside it, initialize a Git repository. Add a very brief README.md
describing that this is a repository for such-and-such assignment. (You can also use this README to further document your workflow, but you don’t have to.) [0.5]
Copy the FASTQ files from /fs/ess/PAS1855/data/week05/fastq
into a directory data/fastq/
inside your assignment’s directory. [0.5]
Create a .gitignore
file and add a line to make Git ignore all .fastq
files. [0.5]
Load the Conda module at OSC and create a Conda environment for Cutadapt following these instructions (i.e. just the section “Installation with Conda”). [1]
To be able to install Cutadapt with the command provided in that link, you first need to do the following general Conda setup steps, which we will (have) do(ne) together in class:
$ conda config --add channels defaults
$ conda config --add channels bioconda
$ conda config --add channels conda-forge
(Adding the above lines in that order will result in the different channels having the proper priority, with conda-forge
having the highest priority.)
If you run into installation problems that you can’t seem to solve, you can can also use my Conda environment in Step 7, below, as follows
$ source activate users/PAS0471/jelmer/.conda/envs/cutadaptenv
But do this only if needed, and if you do, include a description of your errors in your README file.
.yml
) file. [0.5]Now, you will write a script called cutadapt_single.sh
that runs Cutadapt for one pair of FASTQ files: a file with forward (R1
) reads and a file with reverse (R2
) reads for the same sample.
The following instructions all refer to what you should write inside the script:
Start with the shebang line followed by SLURM directives. Specify at least the following SLURM directives [0.5]
PAS1855
.Next, include the familiar set
settings for robust bash scripts, load OSC’s Conda module and then activate your own Cutadapt Conda environment. [0.5]
Use source activate
and not conda activate
to activate the Conda environment, otherwise Conda will complain about your shell not being properly set up.
Let the script take 4 arguments that can be passed to it on the command-line: [1]
data/fastq/201-S4-V4-V5_S53_L001_R1_001.fastq
).results/trim
).GAGTGYCAGCMGCCGCGGTAA
in this case).TTACCGCGGCKGCTGRCACTC
in this case).Give each of these variables a descriptive name rather than directly using the placeholder variables ($1
, etc) – that will make your life easier when writing the rest of the script.
Compute the reverse complement for each primer. The hint below in fact has the answer, but please take a moment to think how you might do this with tr
and a new command called rev
that simply reverses a string.
Adjust your variable names as necessary!
primer_f_rc=$(echo "$primer_f" | tr ATCGYRKMBVDH TAGCRYMKVBHD | rev)
primer_r_rc=$(echo "$primer_r" | tr ATCGYRKMBVDH TAGCRYMKVBHD | rev)
Note that we needed to not only translate ACTG
but also all IUPAC ambiguity codes.
From the file name of the input FASTQ file with forward reads (which is one of the arguments to the script), infer the name of the corresponding FASTQ file with reverse reads, which will have an identical name except that _R1_
is replaced by _R2_
. [0.5]
Assign output file paths (output dir + file name) for the R1
and R2
output file, inserting _trimmed
before the .fastq
file extension (that is, the output file paths should be along the lines of <output-dir>/<old-file-basename>_trimmed.fastq
). [1]
You’ll do yourself a favor by having the script echo
the file names that you have assigned, so you can easily check if you’re doing this right.
Moreover, I recommend that you test interactively whether you’re getting all the file names right: assign one of the paths to the actual FASTQ files to the variable name you’re using for that, and ditto with an output dir. Just take care that these assignments don’t end up in your (final) script.
Create the output directory if it doesn’t already exist. [0.5]
The actual call to the Cutadapt program should be as follows – just change any variable names as needed:
$ cutadapt -a "$primer_f"..."$primer_r_revcomp" \
-A "$primer_r"..."$primer_f_revcomp" \
--discard-untrimmed --pair-filter=any \
-o "$R1_out" -p "$R2_out" "$R1_in" "$R2_in"
Optional (ungraded): Touch up the scripts with additional echo
statements, date
commands, and tests such as whether 4 arguments were provided.
Submit the script as a SLURM job for one pair of FASTQ files – don’t forget to provide it with the appropriate arguments. Check the SLURM log file and the output files. If it didn’t work, troubleshoot until you get it working. [2]
Do any necessary cleaning up of files, e.g. move your SLURM log file to an appropriate place, and make sure everything is committed to the Git repository. (I’ll need to see the SLURM log file in your repository to see if the script worked.) [0.5]
Create a GitHub repository and push your local Git repository to GitHub. Like last time, start an issue and in the issue, tag @jelmerp
. [0.5]
Creating a script like we did above is worth the trouble mostly if we plan to run it for multiple/many samples. Now, you will create a second script cutadapt_submit.sh
that loops over all FASTQ files in a specified directory. It doesn’t need to be a “proper” script with a robust header and so on, and shouldn’t contain any SLURM directives: this script merely functions to submit SLURM jobs and can be run interactively.
.fastq
files with R1
in the name in the input directory (recall: we don’t want to loop over the R2
files explicitly, because they will be automatically included in our previous script).cutadapt_single.sh
script should be submitted as a SLURM job, similar to your single submission of the script above.
Initially, there was an error in the primer sequences provided in last week’s exercises, which has been corrected on Friday, Feb 12.↩︎
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".