Using Software at OSC
Loading existing modules and installing and using software with conda
So far, we have only used commands that are available in any Unix shell. But to actually analyze genomics data sets, we also need to use specialized bioinformatics software.
Most software that is already installed at OSC must nevertheless be “loaded” (“activated”) before we can use it — and if our software of choice is not installed, we have to do so ourselves. We will cover those topics in this module.
1 Setup
2 Running command-line programs
As pointed out in the introduction to the workshop, bioinformatics software (programs) that we use to analyze genomic data are typically run from the command line. That is, they have “command-line interfaces” (CLIs) rather than “graphical user interfaces” (GUIs), and are run using commands that are structurally very similar to how we’ve been using basic Unix commands.
For instance, we can run the program FastQC
as follows, instructing it to process the FASTQ
file sampleA.fastq.gz
with default options:
fastqc sampleA.fastq.gz # Don't run
So, with all the scaffolding we have learned in the previous modules, we only need to make small modifications to have our scripts run command-line programs. But, we first need to load and/or install these programs.
3 Software at OSC with Lmod
OSC administrators manage software with the Lmod
system of software modules. For us users, this means that even though a lot of software is installed, most of it can only be used after we explicitly load it.
(That may seem like a drag, but on the upside, this practice enables the use of different versions of the same software, and of mutually incompatible software on a single system.)
3.1 Checking for available software
The OSC website has a list of software that has been installed at OSC. You can also search for available software in the shell using two subtly different commands:
module spider
lists modules that are installed.module avail
lists modules that can be directly loaded, given the current environment (i.e., depending on which other software has been loaded).
Simply running module spider
or module avail
would spit out complete lists — more usefully, we can add search terms as arguments to these commands:
module spider python
python: |
Versions:
python/2.7-conda5.2
python/3.6-conda5.2
python/3.7-2019.10
module avail python
python/2.7-conda5.2 python/3.6-conda5.2 (D) python/3.7-2019.10
3.2 Loading software
All other Lmod
software functionality is also accessed using module
“subcommands” (we call module
the command and e.g. spider
the subcommand). For instance, to load a module:
# Load a module:
module load python # Load the default version
module load python/3.7-2019.10 # Load a specific version (copy from module spider output)
To check which modules have been loaded (the list includes automatically loaded modules):
module list
Currently Loaded Modules:
1) xalt/latest 2) gcc-compatibility/8.4.0 3) intel/19.0.5 4) mvapich2/2.3.3 5) modules/sp2020
3.3 A practical example
Let’s load a very commonly used bioinformatics program that we will also use in examples later on: FastQC. FastQC performs quality control (hence: “QC”) on FASTQ files.
First, let’s test that we indeed cannot currently use fastqc
by running fastqc
with the --help
flag:
fastqc --help
bash: fastqc: command not found
Next, let’s check whether FastQC is available at OSC, and if so, in which versions:
module avail fastqc
fastqc/0.11.8
There is only one version available (0.11.8
), which means that module load fastqc
and module load fastqc/0.11.8
would each load that same version.
Let’s load the FastQC module:
module load fastqc/0.11.8
After we have loaded the module, we can retry our --help
attempt:
fastqc --help | head # I'm piping into head to avoid pages worth of output
FastQC - A high throughput sequence QC analysis tool
SYNOPSIS
fastqc seqfile1 seqfile2 .. seqfileN
fastqc [-o output dir] [--(no)extract] [-f fastq|bam|sam]
[-c contaminant file] seqfile1 .. seqfileN
4 When software isn’t installed at OSC
It’s not too uncommon that software you need for your project is not installed at OSC, or that you need a more recent version of the software than is available. The main options available to you in such a case are to:
“Manually” install the software, which in the best case involves downloading a directly functioning binary (executable), but more commonly requires you to “compile” (build) the program. This is sometimes straightforward but can also become extremely tricky, especially at OSC where you don’t have “administrator privileges”1 at OSC and will often have difficulties with “dependencies”2.
Send an email to OSC Help. They might be able to help you with your installation, or in case of commonly used software, might be willing to perform a system-wide installation (that is, making it available through
module
).Use
Apptainer
/Singularity
“containers”. Containers are self-contained software environments that include operating systems, akin to mini virtual machines.Use
conda
, which creates software environments that are activated like in themodule
system.
Conda and containers are useful not only at OSC, where they bypass issues with dependencies and administrator privileges, but more generally, for reproducible and portable software environments. They also allow you to easily maintain distinct “environments”, each with a different version of the same software, or with mutually incompatible software.
We will teach conda here because it is easier to learn and use than containers, and because nearly all open-source bioinformatics software is available as a conda package.
5 Using conda
Conda creates so-called environments in which you can install one or more software packages. As mentioned above, these environments are activated and deactivated in a similar manner as with the Lmod
system – but the key difference is that we can create and manage these environments ourselves.
5.1 Loading the (mini)conda module
While it is also fairly straightforward to install conda for yourself 3, we will use OSC’s system-wide installation of conda in this workshop. Therefore, we first need to use a module load
command to make it available:
# (The most common installation of conda is actually called "miniconda")
module load miniconda3
5.2 One-time conda configuration
We will also do some one-time configuration, which will set the conda “channels” (basically, software repositories) that we want to use when we install software. This config also includes setting relative priorities among channels, since one software package may be available from multiple channels.
Like with module
commands, conda commands consist of two parts, the conda command itself and a subcommand, such as config
:
conda config --add channels defaults # Added first => lowest priority
conda config --add channels bioconda
conda config --add channels conda-forge # Added last => highest priority
Let’s check whether this configuration step worked:
conda config --get channels
–add channels ‘defaults’ # lowest priority
–add channels ‘bioconda’
–add channels ‘conda-forge’ # highest priority
5.3 Example: Creating an environment for cutadapt
To practice using conda, we will now create a conda environment with the program cutadapt
installed.
cutadapt
is a commonly used program to remove adapters or primers from sequence reads in FASTQ
files — in particular, it is ubiquitous for primer removal in (e.g. 16S rRNA) microbiome metabarcoding studies. But there is no Lmod
module on OSC for it, so if we want to use it, our best option is to resort to conda.
Here is the command to create a new environment and install cutadapt
into that environment:
conda create -y -n cutadapt -c bioconda cutadapt # Don't run this
Let’s break the above command down:
create
is the conda subcommand to create a new environment.-y
is a flag that prevents us from being asked to confirm installation.Following the
-n
option, we can specify the name of the environment, so-n cutadapt
means that we want our environment to be calledcutadapt
. We can use whatever name we like for the environment, but of course a descriptive yet concise name is a good idea. Since we are making a single-program environment, it makes sense to simply name it after the program.Following the
-c
option, we can specify a channel from which we want to install, so-c bioconda
indicates we want to use thebioconda
channel. (Given that we’ve done some config above, this is not always necessary, but it can be good to be explicit.)The
cutadapt
at the end of the line simply tells conda to install the package of that name. This is a “positional” argument to the command (note that there’s no option like-s
before it): we put any software package(s) we want to install at the end of the command.
Specifying a version
If we want to be explicit about the version we want to install, we can add the version after =
following the package name. We do that below, and we also include the version in the environment name.
Let’s run the command above and see if we can install cutadapt
conda create -y -n cutadapt-4.1 -c bioconda cutadapt=4.1
Collecting package metadata (current_repodata.json): done
Solving environment:
5.4 Activating conda environments
Whereas we use the term “load” for Lmod
modules, we use “activate” for conda environments — it means the same thing. Oddly enough, the most foolproof way to activate a conda environment is to use source activate
rather than the expected conda activate
— for instance:
source activate cutadapt-4.1
(cutadapt-4.1) [jelmer@pitzer-login03 PAS2250]$
After we have activated the cutadapt
environment, we should be able to actually use the program. To test this, we’ll again simply run it with a --help
option:
cutadapt --help | head # I'm piping into head to avoid pages worth of output
cutadapt version 4.1
Copyright (C) 2010-2022 Marcel Martin marcel.martin@scilifelab.se
cutadapt removes adapter sequences from high-throughput sequencing reads.
Usage:
cutadapt -a ADAPTER [options] [-o output.fastq] input.fastq
For paired-end reads:
5.5 Creating an environment for any program
Minor variations on the conda create
command above can be used to install almost any program for which is conda package is available. However, you may be wondering how we would know:
- Whether the program is available and what its conda package’s name is
- Which conda channel we should use
- Which versions are available
My strategy to finding these things out is to simply Google the program name together with “conda”, e.g. cutadapt conda
.
Let’s see that in action:
We click on that first link (it should always be the first Google hit):
I always take the top of the two example installation commands as a template, here: conda install -c bioconda cutadapt
You may notice the install
subcommand, which we haven’t yet seen. This would install Cutadapt into the currently activated conda environment. Since our strategy here –and my general strategy– is to create a new environment each time you’re installing a program, the all-in-one command is to replace install
by create -y -n <env-name>
.
Then, our full command (without version specification) again becomes:
conda create -y -n cutadapt -c bioconda cutadapt
To see which version will be installed by default, and to see which older versions are available:
For almost any other program, this works exactly the same!
5.6 Lines to add to your Bash script
While you’ll typically want to do installation interactively and only need to do to it once (see note below), you should always include the necessary code to load/activate your programs in your shell scripts.
When your program is in an Lmod
module, this simply entails a module load
call — e.g., for fastqc
:
#!/bin/bash
set -ueo pipefail
# Load software
module load fastqc
When your program is available in a conda environment, this entails a module load
command to load conda itself, followed by a source activate
command to load the relevant conda environment:
#!/bin/bash
# Load software
module load miniconda3
source activate cutadapt-4.1
# Strict/safe Bash settings
set -ueo pipefail
6 Addendum: a few other useful conda commands
Deactivate the currently active conda environment:
conda deactivate
Activate one environment and then “stack” an additional environment (a regular
conda activate
command would switch environments):source activate cutadapt # Now, the env "cutadapt" is active conda activate --stack multiqc # Now, both "cutadapt" and "multiqc" are active
Remove an environment entirely:
conda env remove -n cutadapt
List all your conda environments:
conda env list
List all packages (programs) installed in an environment:
conda list -n cutadapt
Footnotes
When your personal computer asks you to “authenticate” while you are installing something, you are authenticating yourself as a user with administrator privileges. At OSC (and for OSU-managed personal computers, too!), you don’t have such privileges.↩︎
Other software upon which the software that you are trying to install depends.↩︎
And this is certainly worth considering if you find yourself using conda a lot, because the conda version at OSC is quite out-of-date.↩︎