class:inverse middle center ## *Week 6: OSC jobs with SLURM* ---- # Part II: <br> SLURM jobs in practice <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/02/18 --- class: inverse middle center # Overview ---- .left[ - ### [More on SLURM directives](#slurm) - ### [Example I: Submitting a script running FastQC](#fastqc) - ### [Example II: Running MultiQC with conda](#multiqc) - ### [More Conda](#more-conda) - ### [SLURM Environment variables](#slurm-env) ] --- name: slurm ## More on SLURM directives **We saw the following directives earlier:** | Resource/use | short | long | default |-------------------------------|------------|----------------------|:---------:| | Project to be billed | `-A PAS0471` | `--account=PAS0471` | _N/A_ | Time limit | `-t 4:00:00` | `--time=4:00:00` | 1:00:00 | Nr of nodes | `-N 1` | `--nodes=1` | 1 | Nr of cores | `-c 1` | `--cpus-per-task=1` | 1 | Nr of "tasks" (processes) | `-n 1` | `--ntasks=1` | 1 | Nr of tasks per node | - | `--ntasks-per-node` | 1 | Memory limit per node | - | `--mem=4G` | *(4G)* --- ## More on SLURM directives (cont.) **Some other options** (for more, see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html)): | Resource/use | short | long |--------------------------------------|----------|------- | Log output file (%j = job number) | `-o` | --output=slurm-fastqc-%j.out | Error output (*stderr*) | `-e` | `--error=slurm-fastqc-%j.err` <br> <br> <br> <br> <br> <br> .content-box-info[ By default, standard output **and** standard error are sent to the same output file (`-o` / `--output`). To send standard error to its own output file, use the `-e` / `--error` option. ] --- ## More on SLURM directives (cont.) **Some other options** (for more, see the [SLURM documentation](https://slurm.schedmd.com/sbatch.html)): | Resource/use | short | long |--------------------------------------|----------|------- | Log output file (%j = job number) | `-o` | `--output=slurm-fastqc-%j.out` | Error output (*stderr*) | `-e` | `--error=slurm-fastqc-%j.err` | Job name (displayed in the queue) | - | `--job-name=fastqc` | Partition (=queue type) | - | `--partition=longserial` | Get email when job starts/ends/fails | - | `--mail-type=ALL` <br> `--mail-type=START` <br> `--mail-type=END` <br> `--mail-type=FAIL` | Let job begin at/after specific time | - | `--begin=2021-02-01T12:00:00` | Let job begin after other job is done | - | `--dependency=afterany:123456` --- ## More on SLURM directives (cont.) In scripts, SLURM directives must be provided before any executable lines: .pull-left[ **Good:** ```sh #!/bin/bash #SBATCH --account=PAS18655 set -e -u -o pipefail ``` ] .pull-left[ **Bad:** ```sh #!/bin/bash set -e -u -o pipefail #SBATCH --account=PAS18655 ``` ] --- class: inverse middle center name: fastqc # Example I: <br> Submitting a script running FastQC ---- <br> <br> <br> <br> --- ## FastQC: A program for quality control of FASTQ files FastQC produces visualizations and assessments of FASTQ files for statistics such as per-base quality (below) and adapter content. .pull-left[ <p align="center"> <img src=img/fastqc_good.png width="80%"> </p> ] .pull-right[ <p align="center"> <img src=img/fastqc_bad.png width="80%"> </p> ] -- <br> - FastQC analyzes one (optionally gzipped) FASTQ file at a time: ```sh $ fastqc --outdir=<output-dir> <fastq-file> ``` --- ## Our basic FastQC script Here is what a script to run FastQC non-interactively could look like: ```sh #!/bin/bash set -e -u -o pipefail fastq_file="$1" output_dir="$2" mkdir -p "$output_dir" echo "Running fastqc for file: $fastq_file" echo "Output dir: $output_dir" fastqc --outdir="$output_dir" "$fastq_file" ``` -- <br> **To run this at OSC, we need to do two additional things:** 1. Add SLURM directives to request the appropriate cluster resources. 1. Load OSC's module for FastQC. --- ## Adding SLURM directives to our FastQC script In this case, we're happy with most of the defaults, which include 1 node and 1 core, so we only specify the following options: - The job should be billed to the `PAS1855` project (recall that we always need to provide the project number): ```sh #SBATCH --account=PAS1855 ``` -- - FastQC does not take long to run, so we can use a time limit of 15 minutes: ```sh #SBATCH --time=15 ``` --- ## Adding SLURM directives to our FastQC script - We add "*fastqc*" to the default output file name for the log, which would be `slurm-<job-number>.out`. This is useful to keep track of what your different log files are about. `%j` is the job number which is printed by default, but not if we specify out output as, say `slurm-fastqc.out`. By including this, we will avoid having multiple jobs that produce the same log file. ```sh #SBATCH --output=slurm-fastqc-%j.out ``` -- <br> .content-box-info[ Other useful patterns include `%u` (user name) and `%x` (job name). ] --- ## Loading the module To be able to run FastQC at OSC, we need to load the module, which we can simply do by adding the following line in the script: ```sh module load fastqc/0.11.8 ``` This line needs to occur before we call FastQC, and it's best placed at the start of the script. --- ## The full script ```sh #!/bin/bash #SBATCH --account=PAS0471 #SBATCH --time=60 #SBATCH --output=slurm-fastqc-%j.out set -e -u -o pipefail module load fastqc/0.11.8 fastq_file="$1" output_dir="$2" mkdir -p "$output_dir" echo "Running fastqc for file: $fastq_file" echo "Output dir: $output_dir" fastqc --outdir="$output_dir" "$fastq_file" ``` --- ## The full script – touched up a bit more ```sh #!/bin/bash #SBATCH --account=PAS0471 #SBATCH --time=15 #SBATCH --output=slurm-fastqc-%j.out set -e -u -o pipefail module load fastqc/0.11.8 fastq_file="$1" output_dir="$2" mkdir -p "$output_dir" date # Report date+time to time script echo "Starting FastQC script..." # Report what script is being run echo "Running fastqc for file: $fastq_file" echo "Output dir: $output_dir" echo -e "---------\n\n" # Separate from program output fastqc --outdir="$output_dir" "$fastq_file" echo -e "\n---------\nAll done!" # Separate from program output date # Report date+time to time script ``` --- ## Submit the job .content-box-diy[ Go to your personal dir and create a dir for this week: ```sh $ cd /fs/ess/PAS1855/users/$USER $ mkdir week06 ``` Now create a new file, paste in the code from the previous slide, and save it as `fastqc_single.sh`. ] - We need to provide the script with an output dir and a FASTQ file, whose names we will store in variables: ```sh $ output_dir=QC_fastq $ fastq_file=/fs/ess/PAS1855/data/fastq_16S/201-S4-V4-V5_S53_L001_R1_001.fastq ``` - Now we can submit the job: ```sh $ sbatch fastqc_single.sh "$fastq_file" "$output_dir" # Submitted batch job 2451088 ``` --- ## Monitoring the SLURM job We can check in on the job using the SLURM command `squeue`: - Initially the job may be queued ("pending") – `PD` in the `ST` (**st**atus) column: ```sh $ squeue -u $USER # JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # 2526085 serial-40 fastqc_s jelmer PD 0:00 1 (None) ``` -- - Then the job should be running – `R` in the `ST` column: ```sh $ squeue -u $USER # JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # 2526085 serial-40 fastqc_s jelmer R 0:02 1 p0002 ``` -- - You can also use the `-l` option for more info: ```sh $ squeue -u $USER -l # JOBID PARTITION NAME USER STATE TIME TIME_LIMI NODES NODELIST(REASON) # 2979968 serial-40 fastqc_s jelmer RUNNING 0:05 15:00 1 p0002 ``` --- ## Check the log and the output files - When the job has finished, `squeue` will return nothing (!): ```sh $ squeue -u $USER # JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) ``` -- - Let's look for *SLURM* log files and look at the contents: ```sh $ ls # slurm-fastqc-2451088.out $ less slurm-fastqc-2451088.out ``` - Check the output dir: ```sh $ ls -lh $output_dir ``` -- <br> .content-box-info[ [Bonus slides](#loop): a second script that loops over many FASTQ files. ] --- class: inverse middle center name: multiqc # Example II: Running MultiQC with conda ---- <br> <br> <br> <br> --- ## MultiQC MultiQC is a program that *summarizes multiple FastQC results* into a single nicely formatted HTML page: <br> ---- <p align="center"> <img src=img/multiqc_front.png width="100%"> </p> --- ## MultiQC - For instance, this figure shows mean quality scores across 132 samples: <p align="center"> <img src=img/multiqc_qual.png width="100%"> </p> --- ## MultiQC and Conda at OSC - Like FastQC, running MultiQC is straightforward: ```sh # This will put the MultiQC report in the working dir: $ multiqc <dir-with-fastqc-reports> # You can also specify the output dir for the MultiQC report: $ multiqc <dir-with-fastqc-reports> -o <output-dir> ``` -- - But, MultiQC is *not* available as a module at OSC (i.e., no `module load`). Therefore, we will install it for ourselves with Conda. --- ## Setting up Conda - To use Conda at OSC, we first need to load the module (every session!): ```sh $ module load python/3.6-conda5.2 ``` - We will also do some one-time configuration, to set the Conda "channels" (repositories) and their priorities: ```sh $ conda config --add channels defaults $ conda config --add channels bioconda $ conda config --add channels conda-forge # Highest priority ``` -- - Let's also checked if that worked: ```sh $ conda config --get channels # --add channels 'defaults' # lowest priority # --add channels 'bioconda' # --add channels 'conda-forge' # highest priority ``` --- ## One-time installation of MultiQC <br> into a Conda environment - Now, we are ready to create a new `conda` environment with Python: ```sh $ conda create -y -n multiqc-env python=3.7 ``` - `create` is the `conda` command to create a new environment. - `-y` avoids a confirmation prompt. - `-n multiqc-env` is the name for the new environment, which is arbitrary. - `python=3.7` is how specify that we want to install Python and specifically, Python 3.7. --- ## One-time installation of MultiQC <br> into a Conda environment - Next, we *activate* the environment and install MultiQC into it: ```sh $ source activate multiqc-env $ conda install -y -c bioconda -c conda-forge multiqc ``` `-c bioconda -c conda-forge` is how we indicate which channels we want to use and prioritize for this particular installation. (These sorts of detailed are provided in software installation instructions.) - Having done that, we should be able to run MultiQC: ```sh $ multiqc --help ``` .content-box-diy[ If you didn't manage to create your own conda environment, you can also use mine: ```sh $ source activate /users/PAS0471/jelmer/.conda/envs/multiqc ``` ] --- ## Side notes .content-box-info[ Every time you want to run MultiQC (and equivalent for other software in Conda environments), you need to include these two lines: ```sh $ module load python/3.6-conda5.2 $ source activate multiqc-env ``` ] .content-box-info[ Here, we use `source activate` instead of the more standard `conda activate` because we otherwise need to run a setup script. ] .content-box-info[ Usually, you can create the Conda environment in one `conda create` step, rather than having to activate it and then using `conda install`, but this two-step procedure seems necessary in this case. ] --- ## Running MultiQC in an interactive job Because MultiQC also runs quickly and only needs to be run once for all of our files, we will run it in an interactive job. To start an interactive job and run MultiQC there: ```sh # Start an interactive job at OSC: $ sinteractive -A PAS0471 # No time specified: will be 30 minutes $ module load python/3.6-conda5.2 $ source activate multiqc-env $ multiqc QC_fastq/ -o QC_fastq/ ``` --- class: inverse middle center name: more-conda # More Conda ---- <br> <br> <br> <br> --- ## More Conda One very nice aspect of Conda with respect to reprocucibility is that its software environments can be fully specified in simple YAML text files that specify key:value combinations. - Export a YAML file with the environment description: ```sh # Export the active environment: $ conda env export > environment.yml # Export any environment: $ conda env export -n multiqc-env > multiqc-env.yml ``` - Now, anyone with access to your environment YAML file can easily recreate your environment for themselves! ```sh $ conda env create --file environment.yml ``` --- ## More Conda (cont.) Some other Conda commands: ```sh # Deactivate active conda environment: $ conda deactivate # Remove an environment entirely: $ conda env remove -n multiqc-env # List all packages in an environment: $ conda list -n multiqc-env # List all your Conda environments: $ conda env list # Clone an environment: $ conda create --name multiqc-env --clone multiqc-env-copy # Search for a package (can use wildcards): $ conda search 'bwa*' # Activate multiple environments (stack a 2nd env): $ conda activate --stack <second-env-name> ``` --- ## More Conda (cont.) It's worth considering using Conda (and/or containers) for external software whenever possible, i.e. even when there is an OSC module available. This will increase the portability of your project (e.g., providing Conda environment YAML files for all software), and will also allow you to use whichever version you want. For example, OSC rarely has the most recent version of software. .content-box-info[ Because the SLURM directives are special comments (`#SBATCH`), they will not interfere with running the script elsewhere, e.g. locally. ] --- class: inverse middle center name: slurm-env # SLURM environment variables ---- <br> <br> <br> <br> --- ## SLURM environment variables - Inside the script, SLURM environment variables will be available, such as: | Variable | Description |---------------------|------------ | `$SLURM_SUBMIT_DIR` | Path to dir from which job was started | `$TMPDIR` | Path to dir for job (fast I/O) | `$SLURM_JOB_ID` | Job ID assigned by SLURM | `$SLURM_JOB_NAME` | Job name supplied by the user | `$SLURM_CPUS_ON_NODE` | Number of CPUs (cores) available on 1 node | `$SLURM_MEM_PER_NODE` | Memory per node (as specified by `--mem`) | `$SLURMD_NODENAME` | Name of the node running the job --- ## Using SLURM environment variables - For example, you may want to use `$SLURM_CPUS_ON_NODE` in the call to a program inside the script. This way, you only have to specify this once and don't run the risk of using more cores than you have available – from: ```sh #SBATCH --cpus-per-task=6 STAR --runThreadN 6 --genomeDir ... ``` ...to: ```sh #SBATCH --cpus-per-task=6 STAR --runThreadN "$SLURM_CPUS_ON_NODE" --genomeDir ... ``` -- - Alternatively, if you specify the number of cores in the `sbatch` call on the command-line, you don't need to hard-code this at all inside the script: ```sh $ sbatch --cpus-per-task=6 star.sh ``` ```sh STAR --runThreadN "$SLURM_CPUS_ON_NODE" --genomeDir ... ``` --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- class: center middle inverse # Bonus material ---- .left[ - ### [A script to loop over FASTQC files](#loop) - ### [More options to monitor and manage SLURM jobs](#monitor) - ### [SLURM job arrays](#slurm-arrays) - ### [Conda setup to enable "conda activate"](#conda-setup) ] <br> <br> <br> <br> --- class: inverse middle center name: loop # A script to loop over FASTQC files ---- <br> <br> <br> <br> --- background-color: #f2f5eb ## Now loop over all files... The script that we wrote above will run FastQC for a single FASTQ file. **Next, we can write a second script that loops over many files and submits the script for each of them.** In this case, we will do so by simply looping over all FASTQ files in a specified directory: ```sh $ for fastq_file in "$fastqc_dir"/*fastq; do scripts/qc/fastqc_single.sh "$fastq_file" "$output_dir" done ``` --- background-color: #f2f5eb ## The full script to loop over a directory Script `fastqc_dir.sh`: ```sh #!/bin/bash set -e -u -o pipefail fastq_dir="$1" output_dir="$2" echo "Submitting FastQC jobs..." echo "Input dir: $fastq_dir" echo "Output dir: $output_dir" echo "Number of files: $(ls $fastq_dir/*fastq | wc -l)" for fastq_file in "$fastqc_dir"/*fastq.gz; do sbatch scripts/qc/fastqc_single.sh "$fastq_file" "$output_dir" done echo -e "\n---------\nAll done!" ``` --- background-color: #f2f5eb ## Run the loop script - Make sure the script is executable: ```sh $ chmod u+x scripts/* ``` - Run the script: ```sh $ fastq_dir=fastq/ $ output_dir=analysis/QC_fastq/ $ scripts/fastqc_dir.sh "$fastq_dir" "$output_dir" #> echo "Submitting FastQC jobs..." #> echo "Input dir: fastq/" #> echo "Output dir: analysis/QC_fastq/" #> echo "Number of files: 8" #> Submitted batch job 2452187 #> Submitted batch job 2452188 #> Submitted batch job 2452189 #> ... ``` .content-box-info[ We didn't need to submit *this* script to the queue: all it does is submit jobs, which can be done just fine on a login node. ] --- background-color: #f2f5eb ## Monitor the loop script - Check the queue: ```sh $ squeue -u $USER ``` - Peak at all the log files: ```sh $ tail slurm-fastqc* $ less slurm-fastqc* # Press ":n" to go the next file ``` <br> -- .content-box-info[ We could also "follow" the output in a log file as it's being added to the file: ```sh tail -f <file> ``` ] --- class: inverse middle center name: monitor # More options to monitor <br> and manage SLURM jobs ---- <br> <br> <br> <br> --- background-color: #f2f5eb ## More options to monitor and manage SLURM jobs: overview - Log into a compute node that is running a job, and use `top`, `htop`, or `ps` to see CPU and memory usage in real time. - Several commands, such as `sstat` (for running jobs), `sacct` (for finished jobs), and `scontrol` can show useful statistics for jobs. - You can use the `time` utility to track and print statistics for individual commands, e.g. `/usr/bin/time fastqc ...` ([bonus slides](#time)). - **XDMoD on OnDemand** – demo? --- background-color: #f2f5eb name: more-monitor ## Additional `squeue` options - Check only a specific job by specifying the job ID: ```sh $ squeue -j <job-ID> ``` - You can also filter your jobs status: ```sh $ squeue -u $USER -t RUNNING ``` --- background-color: #f2f5eb ## "Modifying" SLURM jobs - Update SLURM directives for a job that has already been submitted: ```sh $ scontrol update job=<jobID> timeLimit=5:00:00 ``` - Hold and release a pending (queued) job, e.g. when needing to update input file before it starts running ```sh $ scontrol hold <jobID> # Job won't start running until released $ scontrol release <jobID> # Job is free to start ``` - You can *cancel* pending or running jobs using `scancel`: ```sh $ scancel <jobID> $ scancel -u $USER ``` --- background-color: #f2f5eb ## Print job statistics using `scontrol show` - You can see more details about any running or finished jobs, *including the amount of time it ran for*, using `scontrol show job`: ```sh scontrol show job 2526085 # Replace the number with your JOBID # UserId=jelmer(33227) GroupId=PAS0471(3773) MCS_label=N/A # Priority=200005206 Nice=0 Account=pas0471 QOS=pitzer-default # JobState=RUNNING Reason=None Dependency=(null) # Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0 # RunTime=00:02:00 TimeLimit=01:00:00 TimeMin=N/A # SubmitTime=2020-12-14T14:32:44 EligibleTime=2020-12-14T14:32:44 # AccrueTime=2020-12-14T14:32:44 # StartTime=2020-12-14T14:32:47 EndTime=2020-12-14T15:32:47 Deadline=N/A # SuspendTime=None SecsPreSuspend=0 LastSchedEval=2020-12-14T14:32:47 # Partition=serial-40core AllocNode:Sid=pitzer-login01:57954 # [...] ``` - To get this information for every job you run, you can include this at the end of your script: ```sh scontrol show job $SLURM_JOB_ID ``` --- background-color: #f2f5eb ## SSH into a compute node <br> for information about a running job - When you run `squeue`, you can see which node(s) you job is using: ```sh # JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) # 2526085 serial-40 fastqc_s jelmer R 0:02 1 p0002 ``` - You can then SSH into this node: ```sh ssh p0002 ``` - Once on the node, you can use commands like `ps`, `top`, and `htop` to see memory and CPU usage, and the like. Just be aware that you may not be the only person running jobs on this node, so not all of the processes are yours. --- background-color: #f2f5eb ## SSH into a compute node <br> for information about a running job (cont.) - For instance, to see current memory (`RSS`, in KB) and CPU usage, you can run: ```sh $ ps -u $USER -o %cpu,rss,args ``` <br> - For live (constantly updated) usage statistics, you can use `top` or `htop`, which are also interactive. For `top`, for instance, simply type `top` to start. Then, press <kbd>u</kbd> and enter your user name to only see your processes. For memory usage, pay attention to the `RES` column. Like with `less`, press <kbd>q</kbd> to quit. <br> - To exit the compute node and go back to where you were, type `exit` or press <kbd>Ctrl</kbd>+<kbd>D</kbd>. --- background-color: #f2f5eb ## `sstat` for information about running jobs To get information about things like memory usage, the `sstat` (for running jobs) and `sacct` (for finished jobs) commands are useful. The output of these commands is tabular and by default, too many columns are printed, so very verbose calls are necessary. You may want to use an `sstat` call like the following **at the end of your script**: ```sh sstat -j $SLURM_JOB_ID --format=jobid,avecpu,averss,maxrss,ntasks ``` Which will get you output like: ```sh JobID AveCPU MaxRSS NTasks ------------ ---------- ---------- -------- 2979576.bat+ 01:59.000 244368K 1 ``` .content-box-info[ `MaxRSS` is the maximum amount of memory used by the job. ] --- background-color: #f2f5eb ## `sacct` for info (mostly) about finished jobs - Get information about a specific job: ```sh $ sacct -j 2978487 --format=JobID,JobName,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32 | grep -v ".extern" # JobID JobName NodeList Elapsed State ExitCode MaxRSS AllocTRES #------------ ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- #2978487 fqtest.sh p0224 00:01:54 COMPLETED 0:0 billing=1,cpu=1,gres/gpfs:proje+ #2978487.bat+ batch p0224 00:01:54 COMPLETED 0:0 203332K cpu=1,mem=4556M,node=1 ``` - Get information about all jobs – by default those run in the past day: ```sh $ sacct -u $USER --format=JobID,JobName,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32 | grep -v ".extern" # JobID JobName NodeList Elapsed State ExitCode MaxRSS AllocTRES #-------------------- ---------- --------------- ---------- ---------- -------- ---------- -------------------------------- # 2978362 fastqc_si+ p0224 00:01:51 COMPLETED 0:0 billing=1,cpu=1,gres/gpfs:proje+ # 2978362.batch batch p0224 00:01:51 COMPLETED 0:0 200004K cpu=1,mem=4556M,node=1 # 2978487 fqtest.sh p0224 00:01:54 COMPLETED 0:0 billing=1,cpu=1,gres/gpfs:proje+ # 2978487.batch batch p0224 00:01:54 COMPLETED 0:0 203332K cpu=1,mem=4556M,node=1 # 2978503 fqtest.sh p0224 00:01:55 COMPLETED 0:0 billing=1,cpu=1,gres/gpfs:proje+ ``` --- background-color: #f2f5eb ## Aliases for long job monitoring commands Because you won't remember all of these formatting options, and even if you do, you don't want to type them out, this is a case where aliases are really handy. For instance: ```sh alias sact='sacct --format=JobID,JobName,NodeList,Elapsed,State,ExitCode,MaxRSS,AllocTRES%32' alias stat='sstat --format=jobid,avecpu,averss,maxrss,ntasks' ``` You can then add other options when you use an alias, for example: ```sh $ sact -j 123456 ``` <br> .content-box-info[ To make these aliases always available in interactive sessions, add lines such as these to your Bash configuration file `~/.bashrc`. (Note that this file won't be read by job scripts, so make sure to use the full specification in scripts.) ] --- background-color: #f2f5eb name: time ## `time` To get statistics on a single command, like a memory-intensive program you may be running, use the `time` utility. To use it, prepend `/usr/bin/time` to the beginning of a command: ```sh /usr/bin/time fastqc "$fastq_file" ``` If you use this in a script, the result will be printed to the SLURM log file, e.g: ```sh 106.79user 4.10system 1:52.34elapsed 98%CPU (0avgtext+0avgdata 246444maxresident)k 0inputs+1760outputs (0major+35697minor)pagefaults 0swaps ``` 1. User time (CPU time spent running your program) 1. System time (CPU time spent by your program in system calls) 1. Elapsed time (wallclock) 1. % CPU used 1. Memory, pagefault and swap statistics 1. I/O statistics .content-box-info[ You should use the full path `/usr/bin/time` rather than just `time`. ] --- class: inverse middle center name: bashrc-path # SLURM job arrays ---- <br> <br> <br> <br> <br> --- background-color: #f2f5eb name: slurm-arrays ## SLURM Job arrays As an alternative to writing a loop script, you can also turn a script processing a single file into a job "array", i.e. a series of jobs. Each job in the array gets an ID that you can then use in your script for each job to run with a different value of a certain parameter. Most commonly, you may have a file with sample names and then use the array IDs to subset lines from that file – for instance, for our FastQC script: ```sh #SBATCH --array=1-25 #SBATCH --output=slurm-fastqc-%A-%a.out fastq_file=$(sed -n "${SLURM_ARRAY_TASK_ID}q" fastq_files.txt) fastqc "$fastq_file" ``` --- background-color: #f2f5eb ## SLURM Job arrays ```sh #SBATCH --array=1-25 #SBATCH --output=slurm-fastqc-%A-%a.out fastq_file=$(sed -n "${SLURM_ARRAY_TASK_ID}q" fastq_files.txt) fastqc "$fastq_file" ``` - Above, the array was configured using the `--array` option, simply creating 25 jobs with array IDs from 1 to 25 (assuming 25 FASTQ files). - Then, the SLURM environment variable `$SLURM_ARRAY_TASK_ID` will contain the task ID for each job in the array, and we use that to pull out a line from a file with FASTQ file names with `sed`. - We also use `%A` (master job ID) and `%a` (ID that differs for each job in the array) for the log file name. --- background-color: #f2f5eb ## SLURM Job arrays (cont.) A full script example is below: ```sh #!/bin/bash #SBATCH --account=PAS0471 #SBATCH --time=60 #SBATCH --output=slurm-fastqc-%A-%a.out #SBATCH --array=1-25 set -e -u -o pipefail module load fastqc/0.11.8 fastq_files_list="$1" output_dir="$2" fastq_file=$(sed -n "${SLURM_ARRAY_TASK_ID}q" $fastq_files_list) echo "Running fastqc for file: $fastq_file" echo "Output dir: $output_dir" fastqc --outdir="$output_dir" "$fastq_file" ``` --- background-color: #f2f5eb ## SLURM Job arrays (cont.) - The array values can also be a comma separated list like `SBATCH --array=8,10,15` and can even have a *stride*, with `SBATCH --array=1-7:2` producing array tasks 1, 3, 5, and 7. - You can also throttle an array of jobs: only run a subset of jobs at a time. This can be done as follows: ```sh $ sbatch --array [1-100]%10 fastq.sh ``` The above command would only submit 10 jobs at a time. (This can also be specified inside the script.) --- class: inverse middle center name: conda-setup # Conda setup to enable "conda activate" ---- <br> <br> <br> <br> <br> --- background-color: #f2f5eb ## One-time Conda setup <br> to enable `conda activate` - In contrast to `source activate`, `conda activate` can only be used after a setup script is run. These commands do the same thing, though Conda documentation nowadays recommends to use `conda activate`. Should you want to do so, follow these instructions. <br> - The setup script needs to run in every session, but we can *add code to our bash configuration file* that will make sure the the setup script is run whenever we log in at OSC. This is unfortunately made more complicated because the two OSC cluster, Owens and Pitzer, have Conda installed in different locations. --- background-color: #f2f5eb ## One-time Conda setup <br> to enable `conda activate` - The following lines should be added to your `~/.bashrc`: ```sh if [ -f /apps/python/3.6-conda5.2/etc/profile.d/conda.sh ]; then source /apps/python/3.6-conda5.2/etc/profile.d/conda.sh elif [ -f /usr/local/python/3.6-conda5.2/etc/profile.d/conda.sh ]; then source /usr/local/python/3.6-conda5.2/etc/profile.d/conda.sh fi ``` .content-box-diy[ Open your `~/.bashrc` file (with VS Code or Nano) and add these lines to the bottom. ] --- background-color: #f2f5eb ## One-time Conda setup (cont.) - The `~/.bashrc` config file is run whenever we open a new shell, but we also need to run the setup script now. To do so, we will *source* the edited bash config file: ```sh $ source ~/.bashrc ``` .content-box-info[ The above `~/.bashrc` setup only needs to be run once – it does not need to be repeated every session. ]