conda
environment.cutadapt
.Rmd
(R Markdown) file by clicking the Code
button in the top-right of this page. To convert an Rmd
file to an R script, type the following in an R console: knitr::purl(input="<filename>.Rmd")
.Login to OSC at https://ondemand.osc.edu.
Enter a terminal in your browser at OSC by clicking Cluster
> Pitzer Shell Access
. (Or use this direct link.)
Go to this your directory for this workshop at OSC:
cutadapt
is a software package to remove any adapters and primers that may be present in raw sequence data.
We will use the output from cutadapt
as input for the next step in our workflow, ASV Infererence and Taxon Assignment.
These are the primers that were used in this amplicon sequencing experiment.
We will specify these primers when running cutadapt
, in order to remove them from the fastq files.
For this tutorial, the original fastq files have been subsampled to about 1/3 of their original size to shorten the time needed for computation.
conda
environment with cutadapt
cutadapt
is not available as a module at OSC. However, it can be easily installed there through conda
, see this webpage. We have already done this for you to save some time during this tutorial, so we just need to load the conda
environment.
To work with conda
at OSC, we first need to load the conda
module:
Assuming that you haven’t worked with conda
at OSC before, we also need to a bit of additional one-time setup:
# This will add a line to your bash config file, which runs every time you start
# a shell, to run the conda setup script:
echo ". /apps/python/3.6-conda5.2/etc/profile.d/conda.sh" >> ~/.bashrc
# Next, we source (run) the config file in our current shell to also run the conda
# setup script right now:
source ~/.bashrc
Then, we activate our conda
environment for cutadapt
:
Now, cutadapt
should be in $PATH
, i.e. we can simply call it by its name to run it. To test this:
cutadapt --version # Will just print the version number
cutadapt --help # Will print a whole lot of documentation
cutadapt
Our first line, #!/bin/bash
, is the “shebang” line that tells us what program (language) our script uses, which is bash
in this case.
We’ll provide the SLURM
directives one line at a time, using the #SBATCH
keyword followed by an option: nodes
for the number of nodes, ntasks-per-node
for the number of cores, time
for the maximum walltime of the job (here specificed using just minutes), and account
for the OSC project that should be billed.
We’ll set a couple of options with set
that make bash
run safer, by making it stop the script whenever an error is encountered (by default, the script will keep running.)
We’ll use an echo
statement to print to screen what script we are running, and the date
command to keep track of the script’s runtime. (The -e
option to echo
allows us to print newlines using \n
).
So-called “positional” arguments that are passed onto a script using the syntax ./script.sh arg1 arg2
are represented inside a script as the variables $1
, $2
, etc. First, we will give these variables more informative names.
We’ll create the output directory if it doesn’t already exist. Using the -p
option, mkdir
can create multiple levels at once, and will not complain if the dir(s) already exist.
We’ll report the directory variables to screen, which will be helpful e.g. if we’d need to debug our script.
We’ll save the primer sequences as variables.
# SETUP --------------------------------------------------------
# 1. Command-line arguments:
indir=$1
outdir=$2
# 2. Create output directory if it doesn't already exist:
mkdir -p $outdir
# 3. Report:
echo "## Input dir: $indir"
echo "## Output dir: $outdir"
# 4. Define primers:
primer_f=GAGTGYCAGCMGCCGCGGTAA
primer_r=ACGGACTACNVGGGTWTCTAAT
primer_f_rc=TTACCGCGGCKGCTGRCACTC
primer_r_rc=ATTAGAWACCCBNGTAGTCCGT
fastq
files and run cutadapt
for each pair:We will run cutadapt
separately for each sample, using a for
loop to cycle through all the files. This is made a little more complicated because we have two files for each sample: one with forwards reads (which has “R1” in its filename) and one with reverse reads (“R2”).
We’ll initialize the for
loop, which is done using for x in xyz
syntax. We’ll loop over all the “R1” fastq files, i.e. the ones with the forward reads. We list all of those by taking advantage of the *
wildcard.
Now, $R1
(defined using R1
in the loop initiation line) contains, in each rendition of the loop, a single file name. Because it also contains the dir name, we will extract the file name only using the basename command
. The $()
notation is called command substitution, and simply allows us to assign the output of a command to a variable.
To get the corresponding file with reverse read (which should always have the exact same name except containing “R2” rather than “R1”), we simply substitute “R1” by “R2” in the file name using a little parameter substitution trick with the syntax ${variable_name/pattern/replacement}
.
To make sure all is well, we list the R1 and R2 input files.
Now, we’ll do the primer removal using cutadapt
. We use \
just to allow the command to continue across multiple lines for easy of reading (the \
“escapes” the return/newline).
We indicate the end of the loop with the keyword done
.
# RUN CUTADAPT --------------------------------------------------------
# 1. Initialize the loop:
for R1 in $indir/*_R1_*.fastq.gz
do
# 2. Get the filenames for the fastq files with forward (R1) and reverse (R2) reads:
R1=$(basename $R1) # `basename` will strip the directory from the name
# 3.
R2=${R1/_R1_/_R2_} # This will substitute "_R1_" with "_R2_" in the filename.
# 4. Report input files:
echo "## R1 input file:"
ls -lh $indir/$R1
echo "## R2 input file:"
ls -lh $indir/$R2
# 5. Run cutadapt:
echo -e "\n\n## Running cutadapt..."
cutadapt -a $primer_f...$primer_r_rc -A $primer_r...$primer_f_rc \
--discard-untrimmed --pair-filter=any \
-o $outdir/$R1 -p $outdir/$R2 $indir/$R1 $indir/$R2
# 6. Exit the loop using the `done` keyword
done
Finally, we’ll list our output files, as another way to check whether the script has run as it was supposed, and we’ll also print the date
again to check the (date and) time that our script finished running.
Click here to see the entire script.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --time=60
#SBATCH --account=PAS0471
set -e -u -o pipefail # Run bash in "safe mode", basically
# Report:
echo -e "\n## Starting cutadapt script."
date
# SETUP --------------------------------------------------------
# Command-line arguments:
indir=$1
outdir=$2
# Create output directory if it doesn't already exist:
mkdir -p $outdir
# Report:
echo "## Input dir: $indir"
echo "## Output dir: $outdir"
# Primers:
primer_f=GAGTGYCAGCMGCCGCGGTAA
primer_r=ACGGACTACNVGGGTWTCTAAT
primer_f_rc=TTACCGCGGCKGCTGRCACTC
primer_r_rc=ATTAGAWACCCBNGTAGTCCGT
# Software - making doubly sure this is all loaded:
module load python/3.6-conda5.2 # Load conda module
. /apps/python/3.6-conda5.2/etc/profile.d/conda.sh # Setup conda (should be superfluous)
conda activate /users/PAS0471/osu5685/.conda/envs/cutadaptenv # Activate cutadapt environment
# RUN CUTADAPT --------------------------------------------------------
echo -e "\n## Looping through input files...\n"
for R1 in $indir/*_R1_*.fastq.gz
do
R1=$(basename $R1)
R2=${R1/_R1_/_R2_}
# Report input files:
echo "## R1 input file:"
ls -lh $indir/$R1
echo "## R2 input file:"
ls -lh $indir/$R2
# Trim:
echo -e "\n\n## Running cutadapt..."
cutadapt -a $primer_f...$primer_r_rc -A $primer_r...$primer_f_rc \
--discard-untrimmed --pair-filter=any \
-o $outdir/$R1 -p $outdir/$R2 $indir/$R1 $indir/$R2
# Options:
# "-a"/"-A": Primers for R1/R2
# "--discard-untrimmed": Remove pairs with no primer found
# "--pair-filter=any": Remove pair if one read is filtered (=Default)
echo -e "\n\n ------------------------------------------------------------\n"
done
# REPORT AND FINALIZE --------------------------------------------------------
echo -e "\n## Listing output files:"
ls -lh $outdir
echo -e "\n## Done with cutadapt script."
date
We’ll first check whether we are in the correct directory, so that our relative paths will work:
# Let's check we are in the correct directory:
pwd
# Should return: /fs/project/PAS0471/workshops/2020-12_micro/<your-username>
# If not, run: "cd /fs/project/PAS0471/workshops/2020-12_micro/$USER"
Then we assign our input and output directories to the variables indir
and outdir
, respectively.
We’ll also assign the name we want to give to the SLURM log file to a variable. Here, we’ll also use the %j
SLURM keyword to include the SLURM job number.
indir=data/raw/fastq_subsample
outdir=data/processed/fastq_trimmed
# Give a name to the SLURM output file:
# (The -o flag sets the SLURM log file name, and `%j` represents the job ID)
slurm_file=slurm-cutadapt-$USER-%j.out
Finally, we’re ready to submit the script!
We can check what’s going on with our job using squeue
:
# Initially the job may be queued ("PD" in the "ST" (status) column):
squeue -u $USER
# JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
# 2526085 serial-40 01-cutad jelmer PD 0:00 1 (None)
# Then the job should be running ("R" in the "ST" column):
squeue -u $USER
# JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
# 2526085 serial-40 01-cutad jelmer R 0:02 1 p0002
# When the job has finished, squeue will return nothing (!):
squeue -u $USER
# JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
We could also get more statistics about our job using scontrol
:
scontrol show job 2526085 # Replace the number with your JOBID
# UserId=jelmer(33227) GroupId=PAS0471(3773) MCS_label=N/A
# Priority=200005206 Nice=0 Account=pas0471 QOS=pitzer-default
# JobState=RUNNING Reason=None Dependency=(null)
# Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
# RunTime=00:02:00 TimeLimit=01:00:00 TimeMin=N/A
# SubmitTime=2020-12-14T14:32:44 EligibleTime=2020-12-14T14:32:44
# AccrueTime=2020-12-14T14:32:44
# StartTime=2020-12-14T14:32:47 EndTime=2020-12-14T15:32:47 Deadline=N/A
# SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-14T14:32:47
# Partition=serial-40core AllocNode:Sid=pitzer-login01:57954
# ReqNodeList=(null) ExcNodeList=(null)
# NodeList=p0002
# BatchHost=p0002
# NumNodes=1 NumCPUs=1 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
# TRES=cpu=1,mem=4556M,node=1,billing=1,gres/gpfs:project=0
# Socks/Node=* NtasksPerN:B:S:C=1:0:*:1 CoreSpec=*
# MinCPUsNode=1 MinMemoryCPU=4556M MinTmpDiskNode=0
# Features=(null) DelayBoot=00:00:00
# OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
# Command=/fs/project/PAS0471/workshops/2020-12_micro/master/scripts/01-cutadapt.sh data/raw/fastq_subsample # da/processed/fastq_trimmed
# WorkDir=/fs/project/PAS0471/workshops/2020-12_micro/master
# Comment=stdout=/fs/project/PAS0471/workshops/2020-12_micro/master/slurm-cutadapt-jelmer-2526085.out
# StdErr=/fs/project/PAS0471/workshops/2020-12_micro/master/slurm-cutadapt-jelmer-2526085.out
# StdIn=/dev/null
# StdOut=/fs/project/PAS0471/workshops/2020-12_micro/master/slurm-cutadapt-jelmer-2526085.out
# Power=
# TresPerNode=gpfs:project:1
# MailUser=jelmer MailType=NONE
Let’s see if we have a log file:
Let’s look at the start of the SLURM log file:
head -n 20 slurm-cutadapt-jelmer-2526085.out
# ## Starting cutadapt script.
# Mon Dec 14 14:32:48 EST 2020
# ## Input dir: data/raw/fastq_subsample
# ## Output dir: data/processed/fastq_trimmed
#
# ## Looping through input files...
#
# ## R1 input file:
# -rw-r--r-- 1 jelmer PAS0471 3.1M Dec 13 14:10 data/raw/fastq_subsample/101-S1-V4-V5_S1_L001_R1_001.fastq.gz
# ## R2 input file:
# -rw-r--r-- 1 jelmer PAS0471 3.8M Dec 13 14:10 data/raw/fastq_subsample/101-S1-V4-V5_S1_L001_R2_001.fastq.gz
#
#
# ## Running cutadapt...
# This is cutadapt 3.1 with Python 3.8.6
# Command line parameters: -a GAGTGYCAGCMGCCGCGGTAA...ATTAGAWACCCBNGTAGTCCGT -A # ACGGACTACNVGGGTWTCTAAT...TTACCGCGGCKGCTGRCACTC --discard-untrimmed --pair-filter=any -o # data/processed/fastq_trimmed/101-S1-V4-V5_S1_L001_R1_001.fastq.gz -p # data/processed/fastq_trimmed/101-S1-V4-V5_S1_L001_R2_001.fastq.gz # data/raw/fastq_subsample/101-S1-V4-V5_S1_L001_R1_001.fastq.gz # data/raw/fastq_subsample/101-S1-V4-V5_S1_L001_R2_001.fastq.gz
# Processing reads on 1 core in paired-end mode ...
# Finished in 2.60 s (113 µs/read; 0.53 M reads/minute).
Let’s also look at the end of the file:
tail slurm-cutadapt-jelmer-2526085.out
# -rw-r--r-- 1 jelmer PAS0471 2.2M Dec 14 14:35 604-S4-V4-V5_S72_L001_R2_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 398K Dec 14 14:35 blankM-V4-V5_S73_L001_R1_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 486K Dec 14 14:35 blankM-V4-V5_S73_L001_R2_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 23K Dec 14 14:35 H20-V4-V5_S88_L001_R1_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 26K Dec 14 14:35 H20-V4-V5_S88_L001_R2_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 1.4M Dec 14 14:35 Zymo-V4-V5_S82_L001_R1_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 1.7M Dec 14 14:35 Zymo-V4-V5_S82_L001_R2_001.fastq.gz
#
# ## Done with cutadapt script.
# Mon Dec 14 14:35:19 EST 2020
Finally, we can directly check the output dir:
ls -lh data/processed/fastq_trimmed/ # Or use $outdir
# total 178M
# -rw-r--r-- 1 jelmer PAS0471 3.0M Dec 14 14:32 101-S1-V4-V5_S1_L001_R1_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 3.6M Dec 14 14:32 101-S1-V4-V5_S1_L001_R2_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 3.7M Dec 14 14:33 101-S4-V4-V5_S49_L001_R1_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 4.5M Dec 14 14:33 101-S4-V4-V5_S49_L001_R2_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 2.6M Dec 14 14:33 102-S4-V4-V5_S50_L001_R1_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 3.1M Dec 14 14:33 102-S4-V4-V5_S50_L001_R2_001.fastq.gz
# -rw-r--r-- 1 jelmer PAS0471 2.4M Dec 14 14:33 103-S4-V4-V5_S51_L001_R1_001.fastq.gz
# ....[other files not shown]
All done! Our next step will be to further process the fastq
files with cutadapt
.