class:inverse middle center ## *Week 6: OSC jobs with SLURM* ---- # Part I: <br> An introduction to using OSC <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/02/16 --- ## Overview of this week Now that you have a solid basis in working in the shell and in shell scripting, it's time to learn about OSC (beyond OnDemand) to enable you to run analyses on the cluster. - Today: An introduction to using OSC, including software usage and the SLURM resource manager. - Thursday: SLURM jobs in practice. --- class: inverse middle center # Overview of this presentation ---- .left[ - ### [Introduction](#introduction) - ### [Connecting to OSC](#connect) - ### [OSC file systems](#filesystems) - ### [Using software at OSC](#software) - ### [Compute job options](#jobs) ] --- class: inverse middle center name: introduction # Introduction ----- <br><br><br><br> --- ## What is the OSC? - The **Ohio Supercomputer Center (OSC)** provides computing resources across the state of Ohio (it is not a part of OSU). - OSC has two supercomputers and lots of infrastructure for their usage. -- <br> - Research usage is charged but via institutions at [heavily subsidized rates](https://www.osc.edu/content/academic_fee_model_faq). - Educational usage, like our `PAS1855` project, is free! --- ## Supercomputer? - A highly interconnected set of many processors and storage units. - Also known as: - A **cluster** - A High-Performance Computing cluster (**HPC cluster**) <br> -- ### When do we benefit from using a supercomputer? - Our dataset is too large to be handled efficiently, or even at all, by our computer. - Long-running analyses can be sped up by using more computing power. - We need to repeat a computation many times. --- ## Learning about OSC - OSC regularly has online introductory sessions, both overviews and more hands-on sessions – see the [OSC Events page](https://www.osc.edu/events). -- - OSC provides excellent introductory material at their [Getting Started Page](https://www.osc.edu/resources/getting_started). <p align="center"> <img src=img/gettingstarted.png width="560"> </p> --- ## Learning about OSC (cont.) - Also have a look at all the ["HOWTO" pages](https://www.osc.edu/resources/getting_started/howto) – very specific and includes more advanced material. <br> <p align="center"> <img src=img/HOWTOs.png width="500"> </p> --- ## Terminology: cluster and node #### Cluster - Many processors and storage units tied together. - OSC currently has two clusters: **_Pitzer_** and **_Owens_**. (Largely separate, but same long-term storage space is accessed by both.) <br> -- #### Node - Essentially a powerful computer – one of many in a cluster. - **Login node**: A node you end up on after logging in, which is not meant for computing. *(A few per cluster.)* - **Compute node**: A node for compute jobs. *(Hundreds per cluster.)* --- ## Terminology: core #### Core - One of often many *processors* in a node – e.g. standard *Pitzer* nodes have 40-48 cores. -- - Each core can run and be reserved independently for jobs. - But, using multiple cores for a job is also common, to: - Parallelize computations across cores. - Access more memory. --- ## Terminology: cluster > node > core <br> <p align="center"> <img src=img/cluster-node-core.png width="100%"> </p> --- class:inverse middle center name:connect # Connecting to OSC ----- <br><br><br><br> --- ## Connecting to OSC with "OnDemand" - You can make use of OSC not only through `ssh` at the command line, but also through the web browser: from https://ondemand.osc.edu. <p align="center"> <img src=img/ondemand1.png width="900"> </p> -- <br> - At OnDemand, you can, for example: - Access a terminal directly in your browser. - Run "Interactive Apps" like RStudio, Jupyter Notebooks, and VS Code. - Browse and upload files. .content-box-diy[ **Interface Demo** ] --- ## Connecting to OSC with SSH (Secure Shell) - If you want to log in from your local terminal, you need to **connect through SSH.** - Basic `ssh` usage: ```sh $ ssh <username>@pitzer.osc.edu # e.g. jelmer@pitzer.osc.edu $ ssh <username>@owens.osc.edu # e.g. jelmer@owens.osc.edu ``` You will be prompted for your OSC password. -- .content-box-info[ See [this optional ungraded assignment](https://mcic-osu.github.io/pracs-sp21/) for some SSH setup you can do to make this quicker: - Avoid being prompted for your password. - Use a *shortcut* for `<username>@<cluster>.osc.edu`. There are also a few steps you can take to SSH tunnel to OSC with you local VS Code installation. ] --- ## Login nodes: best practices After logging in to OSC, you're on a **login node**. - Use login nodes for *navigation*, *housekeeping*, and *job submission*. - Any process that is active for >20 minutes or uses >1GB **will be killed**. - It is good practice to avoid even getting close to these limits: login nodes are shared by everyone, and can get clogged. (See [this OSC page]((https://www.osc.edu/supercomputing/login-environment-at-osc)) for more info.) --- ## Transferring files to and from OSC with GUIs #### For small transfers (<1 GB): use OnDemand <p align="center"> <img src=img/ondemand2_circle.png width="85%"> </p> ---- <p align="center"> <img src=img/ondemand3_circle.png width="85%"> </p> -- <br> #### For large transfers (>1 GB): use Globus - Especially useful for very large and/or complex transfers. - Does need a [local installation](https://www.osc.edu/resources/getting_started/howto/howto_transfer_files_using_globus_connect) and some set-up. .content-box-info[ Next week, we'll talk about transferring data from the shell. ] --- class: inverse center middle name: filesystems # File systems at OSC ---- <br><br><br><br> --- ## File systems at OSC OSC has [several file systems]((https://www.osc.edu/supercomputing/storage-environment-at-osc/available-file-systems) with different typical use cases: we'll discuss this in terms of short-term versus long-terms storage. --- ## File systems: long-term storage with daily back-ups **Home dir** - Can always be accessed with shortcuts **`$HOME`** and **`~`**. - 500 GB limit per user. - Your home dir will include your first project ID, e.g. `/users/PAS0471/jelmer`. (However, this is not the project dir!) -- <br> **Project dir** - `/fs/project/<project-ID>` (MCIC project: `/fs/project/PAS0471`) **OR** - `/fs/ess/<project-ID>` (Course project: `/fs/ess/PAS1855`) - The total storage limit is whatever the project's PI sets this to (and pays for...). Note that *charges are for the reserved space*, not the actual usage. --- ## File systems: short-term storage locations **scratch** - `/fs/scratch/<projectID>` or `/fs/ess/scratch/<projectID>`. - Fast I/O – good when working with large input and output files. - **Temporary**: deleted after 120 days, and not backed up. -- <br> **`$TMPDIR`** - Represents temporary storage space *on your compute node*. 1 TB max. - Available in the job script through the environment variable `$TMPDIR`. - **Deleted after job ends** – so copy to/from in job scripts! - See [this bonus slide](#tmpdir) for some example code for working with `$TMPDIR`. --- class: inverse middle center name: software # Using software at OSC ----- <br><br><br><br> --- ## Software at OSC with Lmod (modules) - At OSC, software is managed with the Lmod system. While a lot of software is installed, **much is only available to use after we explicitly load it.** This enables the use of different versions of the same software and of mutually incompatible software one a single system. - For a list of software that has been installed at OSC, see: <https://www.osc.edu/resources/available_software>. --- ## Software at OSC with Lmod (modules) (cont.) - You can also search for available software ("modules") in the shell: - `module spider` list all/matching modules that are installed. - `module avail` lists all/matching modules that *can be directly loaded*, given the current environment. ```sh $ module spider #> [long list ...] $ module spider python #> Versions: #> python/2.7-conda5.2 #> python/3.6-conda5.2 #> python/3.7-2019.10 $ module avail #> [long list ...] $ module avail python #> python/2.7-conda5.2 python/3.6-conda5.2 (D) python/3.7-2019.10 ``` --- ## Software at OSC with Lmod (modules) (cont.) - *Loading* and unloading installed software is also done with `module` commands: ```bash # Load a module: $ module load python # Load default version $ module load python/3.7-2019.10 # Load a specific version # Unload a module: $ module unload python ``` - To check which modules have been loaded: ```sh $ module list ``` *(The list will include some modules that have been loaded automatically.)* --- ## What if software isn't installed at OSC? **Install it yourself or get others to install it:** - Installing yourself is possible but can be tricky (no admin rights...). - For commonly used software, send an email to <oschelp@osc.edu>. - People like me can help with (more niche) bioinformatics software. -- <br> **Alternative approaches (recommended):** - The **Conda** software management system (next slides). - `Singularity` **containers**. - These alternatives are useful not only at OSC – but more generally, for: - Reproducible environments. - Portable environments (especially containers). - Avoidance of dependency hell. --- ## Conda: a software environment manager Conda creates *environments* with one (best-practice) or more software packages, which are activated and deactivated like with the `module` system. When creating environments, Conda handles dependencies yet it does not require admin rights. --- ## Conda: a software environment manager **Benefits:** - Can easily maintain distinct "environments" each with a different version of the same software, or with mutually incompatible software. - Not much of a learning curve. - Can save and share environment instructions as very simple text files. - Lots of bioinformatics software is available as a Conda package. <br> -- **Limits – Conda does not:** - Create an isolated environment that can be exported and used anywhere. - Allow you to run software that requires a different OS / OS version. .content-box-info[ **Containers** address these limitations – see [these bonus slides](#containers). ] --- ## Conda in practice, briefly (more on Thursday) - Load the Conda module at OSC (comes with the Python module): ```bash $ module load python/3.6-conda5.2 ``` - Example: **create** a new environment with "`cutadapt`" installed: ```bash $ conda create -y -n cutadapt-env cutadapt ``` -- - **Activate** the `cutadapt-env` environment and see if we can run Cutadapt: ```bash $ source activate cutadapt-env (cutadapt-env) $ cutadapt --version #> 3.2 ``` *(Note above that Conda will indicate the active environment in the prompt!)* --- class: inverse middle center name: jobs # Compute jobs ----- <br><br><br> --- ## Starting a compute job: background To do analyses at OSC, you need access to a **compute node**. - Automated scheduling software allows 100s of people with different requirements to access compute nodes effectively and fairly. - OSC uses the **SLURM** scheduler (**S**imple **L**inux **U**tility for **R**esource **M**anagement). <br> .content-box-info[ Last year, OSC switched from PBS to SLURM – see [here](https://www.osc.edu/supercomputing/knowledge-base/slurm_migration) for migration info and a comparison of commands and options between the two systems. ] --- ## Starting a compute job: how? Three main options: - In OnDemand, start an "**Interactive App**" like Jupyter Notebook or RStudio. - Run an **interactive shell job**. - Provide *SLURM* directives in or along with **scripts**. <br> <br> .content-box-info[ Both interactive and non-interactive jobs start in the directory that they were submitted from: i.e., your working directory will remain the same. ] --- ## Passing on requests/instructions to SLURM You can provide instructions to the *SLURM* scheduler to specify the resources (cores/memory/time/etc) that you need. These can be provided at the command-line: ```sh $ sinteractive -A PAS1855 -t 60 $ sbatch -A PAS1855 -t 60 myscript.sh ``` As well as **at the start of** scripts in lines with the `#SBATCH` keyword: ```sh #!/bin/bash #SBATCH --account=PAS1855 #SBATCH --time=60 ``` -- .content-box-info[ Many SLURM options have a long format (`--account=PAS1855`) and a short format (`-A PAS1855`), which can generally be used interchangeably. ] --- ## Starting compute jobs: Interactive shell jobs Interactive shell jobs will get you interactive shell access on a compute node. - For short interactive shell jobs, use OSC's custom [`sinteractive`](https://www.osc.edu/supercomputing/knowledge-base/slurm_migration/how_to_monitor_and_manage_jobs) wrapper: ```sh $ sinteractive -A PAS1855 # Default: 30-mins, 1-core, 1-node $ sinteractive -A PAS1855 -t 60 # 60 mins - max for sinteractive ``` .content-box-info[ `sinteractive` only accepts a selection of SLURM options and they have to be specified in short format. ] -- <br> - You can also use `srun` / `salloc`, which support all SLURM options: ```sh $ srun -A PAS1855 -t 60 --pty /bin/bash # Equivalent -- syntax that OSC uses in their docs: $ salloc -A PAS1855 -t 60 srun --pty /bin/bash ``` --- ## Starting compute jobs: Shell scripts - To submit a job, we simply preface the call to our script with `sbatch`: ```sh # Run a script on whatever node we are on: $ ./myscript.sh # Same, but now submit it to the queue: $ sbatch myscript.sh # Submitted batch job 2526085 ``` -- <br> - We can also provide `sbatch` options and/or script arguments: ```sh $ sbatch [sbatch-options] <script> [script-arguments] $ sbatch myscript.sh sampleA.fastq.gz # Submitted batch job 2526086 $ sbatch -t 60 -A PAS1855 --mem=20G myscript.sh # Submitted batch job 2526087 ``` --- ## Starting compute jobs: Shell scripts (cont.) .content-box-info[ Any `sbatch` options provided on the command-line will override the equivalent options provided in the script itself. Our script, `myscript.sh`: ```sh #!/bin/bash #SBATCH --account=PAS1855 #SBATCH --time=00:45:00 #SBATCH --mem=8G ``` Submitting the script: ```sh $ sbatch --mem=20G myscript.sh ``` In this case, the memory request will be `20G` and not `8G`. ] --- ## Compute job options: Project - Your jobs need to be billed to a project ("account"), and you *always* need to specify one. <br> <br> <br> <br> <br> <br> <br> <br> | Resource/use | short | long | default |------------------------------|----------|-----------|:--------:| | Project to be billed | `-A PAS1855` | `--account=PAS1855` | N/A ```sh $ sbatch -A PAS1855 myscript.sh ``` ```sh #!/bin/bash #SBATCH --account=PAS1855 ``` --- ## Compute job options: Time You need to specify a **time limit** for your job (= "wall time"), and your job gets killed as soon as it hits the time limit! - You will only be charged for the time your job *actually used*. - Shorter jobs are likely to start sooner. **If you are uncertain about the time your job will take, ask for (much) more time than you think you will need.** -- <br> .content-box-info[ For single-node jobs, up to 168 hours (7 days) can be requested. If that's not enough, you can request access to the `longserial` queue for jobs of up to 336 hours (14 days). ] --- ## Compute job options: Time (cont.) Acceptable time formats include: - `minutes:seconds` - `hours:minutes:seconds` - `days-hours` - `days-hours:minutes` - `days-hours:minutes:seconds` <br> <br> | Resource/use | short | long | default |------------------------------|----------|--------------------|:-----: | Time limit (4 h) | `-t 4:00:00` | `--time=4:00:00` | 1:00:00 ```sh $ sbatch -t 30 myscript.sh # 30 minutes ``` ```sh #!/bin/bash #SBATCH --time=1:00:00 # 1 hour ``` --- ## Compute job options: Memory Specify a maximum amount of RAM (Random Access Memory) that your job can use. - Like with the time limit, your job **gets killed** when it hits the memory limit. But this is not that common – as the [OSC documentation](https://www.osc.edu/supercomputing/batch-processing-at-osc/job-scripts) says: > There is no need to specify a memory limit unless your memory requirements > are disproportionate to the number of cores you are requesting or you need a large-memory node. - The default unit is MB (MegaBytes) – use "G" for GB. <br> -- | Resource/use | short | long | default | |------------------------------|----------|-----------|:-----| | Memory limit per node (8 GB) | - | `--mem=8G` | *(usually 4G)* ```sh $ sbatch --mem=20G myscript.sh # 20 GB ``` ```sh #!/bin/bash #SBATCH --mem=20G ``` --- ## Compute job options: Nodes, cores, and tasks | Resource/use | short | long | default |-------------------------------|----------|-------------------------|:--------:| | Number of nodes | `-N 1` | `--nodes=1` | 1 | Number of cores (per task) | `-c 1` | `--cpus-per-task=1` | 1 | Number of "tasks" (processes) | `-n 1` | `--ntasks=1` | 1 | Number of tasks per node | - | `--ntasks-per-node=1` | 1 <br> -- .content-box-warning[ - SLURM for the most part uses "**core**" and "**CPU**" (which often technically contains multiple cores) interchangeably. - In this context, "**thread**" (a single core may contain multiple threads) is *also* used in the same way, so when we talk about *multi-threading*, think about using multiple cores. Similarly, when a program asks for `nThreads` or similar, this should be equal to the number of *cores* you have asked for (`-c`). ] --- ## Compute job options: Nodes, cores, and tasks | Resource/use | short | long | default |-------------------------------|----------|-------------------------|:--------:| | Number of nodes | `-N 1` | `--nodes=1` | 1 | Number of cores | `-c 1` | `--cpus-per-task=1` | 1 | Number of "tasks" (processes) | `-n 1` | `--ntasks=1` | 1 | Number of tasks per node | - | `--ntasks-per-node=1` | 1 .content-box-info[ - Only ask for **>1 node** when you have explicit parallelization with something like MPI (uncommon in bioinformatics). - Multi-threading, i.e. a single program run using multiple threads/cores, is much more common than having multiple processes in a single job. - **For jobs with multi-threading, use `--cpus-per-task=n` (and keep `--nodes` and `ntasks` at their defaults of 1).** - For jobs with multiple processes (tasks), use `--ntasks=n` or `--ntasks-per-node=n` (but keep `--nodes=1` unless you use MPI or similar). ] --- ## Compute job options: Nodes, cores, and tasks | Resource/use | short | long | default |-------------------------------|----------|-------------------------|:--------:| | Number of nodes | `-N 1` | `--nodes=1` | 1 | Number of cores | `-c 1` | `--cpus-per-task=1` | 1 | Number of "tasks" (processes) | `-n 1` | `--ntasks=1` | 1 | Number of tasks per node | - | `--ntasks-per-node=1` | 1 ```sh #!/bin/bash #SBATCH --cpus-per-task=2 # Preferred for multi-threading ``` ```sh #!/bin/bash #SBATCH --ntasks=2 # Preferred for multi-process # Or --ntasks-per-node=2, which is # equivalent for single-node jobs ``` --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- class: inverse center middle name:bonus # Bonus material ----- <br> .left[ - ### [Working with $TMPDIR in SLURM scripts](#tmpdir) - ### [Working with Singularity containers](#containers) ] <br><br> --- background-color: #f2f5eb name: tmpdir ## Working with `$TMPDIR` in scripts Copy input to `$TMPDIR`, copy output to home: ```bash cp -R $HOME/my/data/ $TMPDIR/ #[...analyze data with I/O happening in $TMPDIR....] cp -R $TMPDIR/* $HOME/my/data/ ``` <br> One potential problem with this approach is that your script may not reach the final line, where the data is copied back: perhaps it encountered an error or was killed. But you may still be interesting in getting the output data. The following trick with `trap` will copy the data even if the job is killed: ```bash trap "cd $PBS_O_WORKDIR;mkdir $PBS_JOBID;cp -R $TMPDIR/* $PBS_JOBID;exit" TERM ``` --- class: inverse middle center # Working with Singularity containers ---- <br> <br> <br> <br> --- background-color: #f2f5eb name: containers ## Containers versus Virtual Machines (VMs) - Containers don't virtualize entire Operating Systems: - Containers are smaller. - Containers are more lightweight to run. <p align="center"> <img src=img/containers.png width="65%"> </p> .content-box-info[ - Use a VM for long interactive sessions with multiple programs. - Use a container to run one or a few programs non-interactively. ] --- background-color: #f2f5eb ## Docker versus Singularity containers Docker and Singularity are two different container platforms; Docker is by far the most widely used. However, Singularity: - Was specifically designed for HPCs & is available at OSC - Needs no admin rights to run containers - Can work with Docker images but not vice versa - Integration with several host directories by default - Has single-file container images - Is open-source --- background-color: #f2f5ebd ## Getting a container Container images for most software can be found at these registries: - Docker Hub: <https://hub.docker.com/> - Sylabs Cloud Library: <https://cloud.sylabs.io/> - Biocontainers (for bioinformatics software): <https://biocontainers.pro/> If no suitable image is available, you can build your own! <br> -- .center[ **Container terminology** ] | Term | Meaning |---------------|--------- | Image | File(s) containing container application | Container | A *running* container image | Definition file (Singularity) <br> / Dockerfile (Docker) | Recipe to build a container image --- background-color: #f2f5eb ## Downloading a container image - Download (`pull`) a container with MultiQC: ```sh $ singularity pull <registry>://<user>/<project>/<name>:<tag> $ singularity pull docker://quay.io/biocontainers/multiqc:1.9--py_1 #> INFO: Converting OCI blobs to SIF format ``` (Note that it pulls a *Docker* image and converts it automatically.) - Now we have `.sif` container image: ```sh $ ls multiqc* #> multiqc_1.9--py_1.sif ``` .content-box-info[ An example of downloading a container from Singularity hub ("shub"): ```sh singularity pull shub://singularityhub/hello-world #> INFO: Downloading shub image ``` ] --- background-color: #f2f5eb ## Interacting with a container image - To "run" the container, i.e. run whatever is defined as the default action: ```sh $ singularity run multiqc_1.9--py_1.sif Singularity> # In this case, opens terminal inside container Singularity> multiqc --help # In which we can run multiqc #> Usage: multiqc [OPTIONS] <analysis directory> ``` - The `shell` sub-command *always* opens a terminal inside the container: ```sh $ singularity shell multiqc_1.9--py_1.sif Singularity> ``` - **Most useful is the the `exec` sub-command**, which lets you directly execute a shell command – and then exits: ```sh $ singularity exec multiqc_1.9--py_1.sif multiqc --help ``` --- background-color: #f2f5eb ## "Containerizing" an analysis or script - Running a locally installed program: ```sh $ multiqc --help ``` - Run the same command inside a container: ```sh $ singularity exec multiqc_1.9--py_1.sif \ multiqc --help ```