class:inverse middle center ## *Week 2: <br> Project Organization and Markdown* ---- # Part III: Managing Files in the Shell <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021-01-21 (updated: 2021-04-25) --- class: center middle inverse # Overview ---- .pull-left[ ### [Wildcard expansion](#wildcard) ### [Brace expansion](#brace) ### [Command substitution](#command-sub) ] .pull-right[ ### [Programmatically renaming files](#renaming) ### [File permissions](#permissions) ] <br> <br> <br> --- ## VS Code setup - **Launch VS Code**: Go to <https://ondemand.osc.edu> => `Interactive Apps` => `Code Server` => 1 or 2 hours => `Launch` ... => `Connect to VS Code`. - **Open your workspace**: On the *Welcome* page is a heading `Recent`, and your workspace should be listed as "`<workspace-name> (Workspace)`". ---- - **Open a new file**: <i class="fa fa-bars"></i> => `File` => `New File`. Save it as an `.md` file inside `week02`. - **Open a terminal** (<kbd>Ctrl</kbd>+<kbd>backtick</kbd> or <i class="fa fa-bars"></i> => `Terminal` => `New Terminal`), and issue the command `bash` to break out of the Singularity shell. -- <br> .content-box-diy[ **Add keyboard shortcut to run shell commands from the editor:** - Click the <i class="fa fa-cog"></i> (bottom-left) => `Keyboard Shortcuts`. - Find `Terminal: Run Selected Text in Active Terminal`, click on it, then add a shortcut, e.g. <kbd>Ctrl</kbd>+<kbd>Enter</kbd>. ] --- ## Create a dummy project – following Buffalo - Go to the dir for this week you created earlier: ```sh $ cd week02 # Inside /fs/ess/PAS1855/users/$USER/ ``` - Create a set of directories representing a dummy project: ```sh $ mkdir zmays-snps $ cd zmays-snps $ mkdir -p data/seqs scripts analysis ``` -- - Create some empty "sequence files": ```sh $ cd data/seqs $ touch sample1_F.fastq.gz sample1_R.fastq.gz \ sample2_F.fastq.gz sample2_R.fastq.gz \ sample3_F.fastq.gz sample3_R.fastq.gz ``` - For a nice overview of your directory structure: ```sh $ tree ../.. # "../.." tells tree to start two levels up ``` --- class: inverse middle center name: wildcard # Shell expansion I: Wildcard expansion ---- <br><br><br> --- ## Shell expansion I: Wildcard expansion |Wildcard | Matches | |-|-| | * | Any number of any character, including nothing | | ? | Any single character | [] and [^] | One or none (`^`) of the "character set" within the brackets -- - With the following files in a directory: ```sh # sample1_F.fastq.gz sample1_R.fastq.gz # sample2_F.fastq.gz sample2_R.fastq.gz # sample3_F.fastq.gz sample3_R.fastq.gz ``` - **To match both "*sample1*" files:** ```sh $ ls sample1_?.fastq.gz $ ls sample1* $ ls sample1*fastq.gz ``` --- ## Shell expansion I: Wildcard expansion (cont.) |Wildcard | Matches | |-|-| | * | Any number of any character, including nothing | | ? | Any single character | [] and [^] | One or none (`^`) of the "character set" within the brackets - With the following files in a directory: ```sh # sample1_F.fastq.gz sample1_R.fastq.gz # sample2_F.fastq.gz sample2_R.fastq.gz # sample3_F.fastq.gz sample3_R.fastq.gz ``` - **To match only files with forward (*"F"*) reads:** ```sh $ ls *F* $ ls *F.fastq.gz ``` --- ## Shell expansion I: Wildcard expansion (cont.) |Wildcard | Matches | |---------|---------| | * | Any number of any character, including nothing | | ? | Any single character | [] and [^] | One or none (`^`) of the "character set" within the brackets - With the following files in a directory: ```sh # sample1_F.fastq.gz sample1_R.fastq.gz # sample2_F.fastq.gz sample2_R.fastq.gz # sample3_F.fastq.gz sample3_R.fastq.gz ``` - **To match files for sample1 and sample2 using only a character class:** ```sh $ ls sample[12]* # Method 1: List all possible characters $ ls sample[1-2]* # Method 2: Use ranges - [0-9], [A-Z], [a-z]. $ ls sample[^3]* # Method 3: *Exclude* something. ``` -- .content-box-warning[ `[]` works on character ranges only: 0-9 works but 10-13 does not. ] --- ## Shell expansion I: Wildcard expansion (cont.) - Wildcards are only used to *match and expand to* existing file names: this process is also referred to as "**_globbing_**". .content-box-q[ Why do we call globbing a type of ***shell* expansion**? ] -- .content-box-answer[ The expansion –to all matching file names– is done *by the shell*, not by `ls` or another command you might be using wildcards with. **Therefore, `ls` will see the list of files *after* the expansion.** ] -- - You can make use of this when using shell expansion to *move* or *delete* files – first check what your command will see, then execute: ```sh $ echo sample[12]* # sample1_F.fastq.gz sample1_R.fastq.gz sample2_F.fastq.gz sample2_R.fastq.gz $ rm sample[12]* ``` --- ## Shell expansion I: Wildcard expansion (cont.) .content-box-info[ Examples so far may seem trivial, but you can use these techniques to easily operate on selections among 100s or 1000s of files. ] <br> .content-box-warning[ Don't confuse wildcards with **regular expressions**! Regular expressions are useful across the board when coding, and we will have a Python module devoted to them (week 11). ] --- class: inverse middle center name: brace # Shell expansion II: Brace expansion ---- <br><br><br> --- ## Shell expansion II: Brace expansion ```sh cd ../.. # to: zmays-snps ``` <br> *Wildcard expansion* looks for corresponding files and expands to whichever files are present: can't be used to create files or dirs. **_Brace expansion_**, on the other hand, expands to whatever you tell it to: -- - Use `..` to indicate ranges: ```sh $ mkdir -p data/obs/2021-01-{01..31} # Numeric range works! $ mkdir figs $ touch figs/fig-1{A..F} # Character range works ``` - Use a comma-separated list: ```sh $ mkdir data/obs/treatment-{Kr,Df,Tr}_temp-{lo,med,hi} ``` --- ## Shell expansion II: Brace expansion *Wildcard expansion* looks for corresponding files and expands to whichever files are present: can't be used to create files or dirs. **_Brace expansion_**, on the other hand, expands to whatever you tell it to. .content-box-q[ What will happen in these two cases? ```sh $ cd figs $ ls # fig-1A fig-1B fig-1C fig-1D fig-1E fig-1F ``` ```sh $ ls fig-1[A-G] ``` ```sh $ ls fig-1{A..G} ``` ] --- ## Shell expansion II: Brace expansion *Wildcard expansion* looks for corresponding files and expands to whichever files are present: can't be used to create files or dirs. **_Brace expansion_**, on the other hand, expands to whatever you tell it to. .content-box-answer[ - Wildcard expansion will ignore the missing "*plot-G*": ```sh $ ls fig-1[A-G] # fig-1A fig-1B fig-1C fig-1D fig-1E fig-1F ``` - Brace expansion will complain about the missing "*plot-G*": ```sh $ ls fig-1{A..G} # ls: cannot access 'plot-G': No such file or directory # fig-1A fig-1B fig-1C fig-1D fig-1E fig-1F ``` ] --- ## <i class="fa fa-user-edit"></i> Your turn: File matching 1. Go into `zmays-snps/data/seqs/` and remove the files in there. 2. Using *brace expansion* and the **`touch`** command, create empty `R1` and `R2` fastq files for 100 samples with IDs from `001` to `100`: `sample<ID>_R1_001.fastq` and `sample<ID>_R2_001.fastq`. 3. Using *wildcard matching* and by piping **`ls`** output into **`wc -l`**, count the number of "`R1`" files (forward reads). 4. **_Bonus:_** Copy all files *except the two for "`sample100`"* into a new directory called `selection` — use a wildcard to do the move with a single command. (You will first need to create the new dir separately.) --- ## <i class="fa fa-user-edit"></i> Solutions: File matching 1. Go into `zmays-snps/data/seqs/` and remove the files in there. ```sh $ cd ../data/seqs/ # Assuming you were still in figs $ rm *fastq.gz ``` 2. Create empty `R1` and `R2` fastq files for 100 samples: ```sh $ touch sample{001..100}_R{1,2}_001.fastq ``` 3. Count the number of `R1` files: ```sh $ ls *R1*fastq | wc -l ``` 4. **_Bonus:_** Copy all files except "`sample100`" into a new dir called `selection`: ```sh $ mkdir selection # Exclude sample numbers starting with a 1: $ cp sample[^1]* selection/ # Other way around: Include sample numbers starting with a 0: $ cp sample0* selection/ ``` --- class: inverse middle center name: command-sub # Shell expansion III: Command substitution ---- <br><br><br> --- ## Shell expansion III: Command substitution - **Command substitution** allows you to store and pass the output of a command (or pipeline of commands) to another command. These two examples show how that can be useful, and that a pipe wouldn't work: ```sh $ echo "I have $(ls *fastq | wc -l) fastq files." # I have 200 fastq files. $ mkdir ../../results_$(date +%F) # results_2021-01-21 ``` -- - It's often convenient to save the result in a *variable*: ```sh $ n_files=$(ls *fastq | wc -l) $ echo "$n_files" ``` .content-box-info[ Variables don't always have to be quoted like above (`$n_files` vs `"$n_files"`), but it is good practice as failing to do so *can* lead to problems. (Also: a variable can be put inside braces: `${n_files}`.) ] --- class: inverse middle center name: renaming # Programmatically renaming files ---- <br><br><br> --- ## Programmatically renaming files - There are many different ways to do this in the shell – admittedly none as easy as one might have hoped. Here, we'll use the `basename` command and a `for` loop. - `basename` strips any dir name that may be present from a file name (path), and optionally, removes any suffix too: ```sh $ basename selection/sample001_R1_001.fastq # sample001_R1_001.fastq # Specify a suffix to remove: $ basename selection/sample001_R1_001.fastq "_001.fastq" # sample001_R1 ``` --- ## Programmatically renaming files (cont.) `for` loops are a verbose method for tasks like renaming, but are intuitive and good practice. - The following shows the basic syntax of a for loop, and how the variable that is looped over is *assigned* to `oldname`, but *recalled* as `$oldname`: ```sh $ for oldname in *.fastq; do > echo "Old name: $oldname" > done ``` -- .content-box-info[ Note how we are using *globbing* (wildcard file name matching) to define what will be looped over. ] --- ## Programmatically renaming files (cont.) `for` loops are a verbose way for tasks like renaming, but are intuitive and good practice. - Next, we assign a new name for each file, and `echo` the names as a sanity check: ```sh $ for oldname in *.fastq; do > newname=$(basename "$oldname" _001.fastq).fq > echo "Old/new name: $oldname $newname" > done ``` --- ## Programmatically renaming files (cont.) `for` loops are a verbose way for tasks like renaming, but are intuitive and good practice. - Finally, we add the actual `mv` command: ```sh $ for oldname in *.fastq; do > newname=$(basename "$oldname" _001.fastq).fq > echo "Old/new name: $oldname $newname" > mv "$oldname" "$newname" > done ``` --- class: inverse middle center name: permissions # Treating data as read-only: # Viewing and modifying file permissions ---- <br><br><br> --- ## Showing file permissions - To show file permissions, use **`ls`** with the **`-l`** (*long* format) option.shows file permissions. (The command below also uses the **`-a`** option to show all files, including hidden ones, and **`-h`** shows file sizes in "human-readable" format.) <p align="center"> <img src=img/long-ls.png width="700"> </p> --- ## File permission notation in `ls -l` <br> <p align="center"> <img src=img/permissions.svg width="50%"> </p> --- ## Changing file permissions This can be done in several ways with the `chmod` command: ```sh # chmod <who>=<permission-to-set>`: $ chmod a=rwx # a: all / u: user / g: group / o: others # chmod <who>+<permission-to-add>: $ chmod a+r # chmod <who>-<permission-to-remove>: $ chmod o-w ``` -- - Using numbers: .pull-left[ | Nr | Permission | |----|----------------| | 1 | e**x**ecute | | 2 | **w**rite | | 4 | **r**ead | | 5 | r + x | | 6 | r + w | | 7 | r + w + x | ] .pull-right[ Three numbers are needed to specify permissions for user, group, and others: ```sh # rwx permissions for all: $ chmod 777 file.txt # rwx for me, r for others: $ chmod 744 file.txt ``` ] --- ## File permissions – Making your data read-only **To make your raw data read-only:** ```sh $ chmod a=r data/raw/* # *a*ll = *r*ead $ chmod u-w data/raw/* # Remove write permission for *u*ser $ chmod 400 data/raw/* # User: read // group & others: none ``` -- <br> .content-box-info[ **Permissions for directories** One tricky aspects of file permissions is that to list a directory's content, you need *execute* permissions for the dir! This is something to take into account when you want to grant others access to your project e.g. at OSC. To set execute permissions *only for dirs* throughout a dir hierarchy: ```sh chmod -R a+X my-project-dir # Note the *uppercase* X ``` ] --- ## <i class="fa fa-user-edit"></i> Changing file permissions: In practice - First, let's look at the original permissions of the files: ```sh # Assuming you are still in week02/zmays-snps/data/seqs $ ls -l ``` - Then, limit permissions for **a**ll to **r**ead, and check what we did: ```sh $ chmod a=r *fastq* $ ls -l ``` -- - What happens when we try to remove write-protected files? ```sh $ rm *fastq* $ rm -f *fastq* ``` .content-box-info[ While *you* (the file owner) can still remove write-protected files, others really can't read/write/execute files without appropriate permissions. ] --- class: inverse middle center name: bonus # Questions? ---- <br> <br> <br> .left[ ### Bonus slides: - ### [Special characters and escaping them](#escape) - ### **[Working with symbolic links](#symlinks)** - ### [More conditional file processing](#conditional) - ### [Two more tricks from Buffalo Ch. 3](#tricks) ] --- background-color: #f2f5eb name: escape ## Special characters and escaping them We saw that wildcards like **`*`** have a special meaning. **What if we had terrible file names, and needed to match a *literal* `*`**? - A **backslash** ("**`\`**") can be used to "escape" the special meaning of the character that follows it: ```sh $ ls # 'my bad*file' $ rm my\ bad\*file # Tab autocomplete will work and insert \'s ``` - **Quoting** a string of characters will also escape special meaning of characters within them, with single quotes (**`'...'`**) being more severe than double quotes (**`"..."`**) — more on this later. ```sh $ rm 'my bad*file' ``` .content-box-info[ Backslashes can also *assign* special meaning – e.g. **`\t`** (tab) and **`\n`** (newline). We'll talk more about these *regular expressions* later. ] --- class: inverse middle center name: symlinks # Using files across projects: # Using symbolic links ---- <br><br><br> --- background-color: #f2f5eb ## Creating symbolic links: single files - A symbolic (or soft) links only links to the *path* of the original file, whereas a hard link directly links to the *contents* of the original file. Modifying a file via either a hard or soft link *will* modify the original file. -- - Create a *symlink* to a file using `ln -s <source-file> [<link-name>]`: ```sh # Only provide source: create link of the same name in the wd: $ ln -s ~/share/scripts/align.sh # The link can also be given an arbitrary name/path: $ ln -s ~/share/scripts/align.sh align_from-shared.sh ``` -- .content-box-info[ It's best to always **use an absolute path for the source file(s)**. The `$PWD` environment variable can come in handy to do so: ```sh $ ln -s $PWD/shared-scripts/align.sh project1/scripts/ ``` ] --- background-color: #f2f5eb ## Creating symbolic links: multiple files - Link to multiple files in a directory at once: ```sh $ ln -s $PWD/shared_scripts/* project1/scripts/ ``` - Link to a directory: ```sh $ ln -s $PWD/shared_scripts/ project1/scripts/ $ ln -s $PWD/shared_scripts/ project1/scripts/ln-shared-scripts ``` -- .content-box-warning[ Be careful when linking to *directories*: you are creating a **point of entry to the original dir**. Therefore, even if you enter via the symlink, you are interacting with the original files: **Do not use `rm -rf symlink-to-dir`!** Instead, `rm symlink-to-dir` or `unlink symlink-to-dir` will only remove the link. ] --- background-color: #f2f5eb ## Creating symbolic links: practice .content-box-diy[ Create a symbolic link in your `$HOME` dir that points to your personal dir in the project dir (`/fs/ess/PAS1855/users/$USER`). If you don't provide a name for the link, it will be your username (why?), which is not particularly informative about its destination. Therefore, give it a name that makes sense to you, like `PLNTPTH8300-SP21` or `pracs-sp21`. ] -- .content-box-answer[ ```sh $ ln -s /fs/ess/PAS1855/users/$USER $HOME/PLNTPTH8300-SP21 ``` ] -- .content-box-q[ What will happen if you do `rm -rf $HOME/PLNTPTH8300-SP21`? ] -- .content-box-answer[ The content of the original dir will be removed. ] --- background-color: #f2f5eb name:conditional ## More conditional file processing **To remove all empty files in a directory:** ```sh $ for file in *; do > if [ ! -s "$file" ] > then > echo rm "$file" > fi > done ``` An overview of *file test operators* that can be used in this way: | Operator | Returns true if: |----------|---------------- | `-s` | file is not zero size | `-e` | file exists | `-f` | file is a regular file (e.g. not a directory) | `-d` | file is a directory | `-h` | file is a symbolic link | `-r`/ `-w` / `-x` | file has read/write/execute permissions --- background-color: #f2f5eb name:conditional ## More conditional file processing **To remove all empty files in a directory:** ```sh $ for file in *; do > if [ ! -s "$file" ]; then > echo rm "$file" > fi > done ``` <br> .content-box-info[ Constructions similar to the one above can be applied to many tasks, though this particular task can be achieved much more concisely using `find`: ```sh $ find . -size 0 -delete ``` ] --- background-color: #f2f5eb name:tricks ## Two more tricks from Buffalo Chapter 3 ### `tee` Redirect standard out to file *and* to screen / into a pipe: ```sh $ program1 in.txt | tee intermediate-file.txt | program2 > out.txt ``` <br> ### `tail -f` "Follow" a file – monitor output of a long-running command: ```sh $ tail -f # Press `Ctrl + C` to stop ```