Exercises: Week 2

Main exercises

Exercise 1: Course notes in Markdown

Create a Markdown document with course notes. I recommend writing this document in VS Code.

Make notes of this week’s material in some detail. If you have notes from last week in another format, include those too. (And try to keep using this document throughout the course!)

Some pointers:

Exercise 2

While doing this exercise, save the commands you use in a text document – either write in a text document in VS Code and send the commands to the terminal, or copy them into a text document later.

  1. Getting set up
    Create a directory for this exercise, and change your working dir to go there. You can do this either in your $HOME dir (e.g. ~/pracs-sp21/w02/ex2/) or your personal dir in the course’s project dir (/fs/ess/PAS1855/users/$USER/w02/ex2/).

  2. Create a disorganized mock project
    Using the touch command and brace expansions, create a mock project by creating 100s of empty files, either in a single directory or a disorganized directory structure.

    If you want, you can create file types according to what you typically have in your project – otherwise, a suggestion is to create files with:

    • Raw data (e.g. .fastq.gz)
    • Reference data (e.g. .fasta),
    • Metadata (e.g. .txt or .csv)
    • Processed data and results (e.g. .bam, .out)
    • Scripts (e.g. .sh, .py or .R)
    • Figures (e.g. .png or .eps)
    • Notes (.txt and/or .md)
    • Perhaps some other file type you usually have in your projects.
  3. Organize the mock project
    Organize the mock project according to some of the principles we discussed this week.

    Even while adhering to these principles, there is plenty of wiggle room and no single perfect dir structure: what is optimal will depend on what works for you and on the project size and structure. Therefore, think about what makes sense to you, and what makes sense given the files you find yourself with.

    Try to use as few commands as possible to move the files – use wildcards!

  4. Change file permissions
    Make sure no-one has write permissions for the raw data files. You can also change other permissions to what you think is reasonable or necessary precaution for your fictional project.

Hints

Use the chmod command to change file permissions and recall that you can use wildcard expansion to operate on many files at once.

See the slides starting from here for an overview of file permissions and the chmod command.

Alternatively, chmod also has an -R argument to act recursively: that is, to act on dirs and all of their contents (including other dirs and their contents).

  1. Create mock alignment files
    Create a directory alignment inside an appropriate dir in your project (e.g. analysis, results, or a dir for processed data), and create files for all combinations of 30 samples (01-30), 5 treatments (A-E), and 2 dates (08-14-2020 and 09-16-2020), like so: sample01_A_08-14-2020.sam - sample50_H_09-16-2020.sam.

    These 300 files can be created with a single touch command. (If you already had an alignment dir, first delete its contents or rename it.)

Hints

Use brace expansion three times: to expand sample IDs, treatments, and dates.

Note that {01..20} will successfully zero-pad single-digit numbers.

  1. Rename files in a batch
    Woops! We stored the alignment files that we created in the previous step as SAM files (.sam), but this was a mistake – the files are actually the binary counterparts of SAM files: BAM files (.bam).

    Move into the dir with your misnamed BAM files, and use a for loop to rename them: change the extension from .sam to .bam.

Hints

  1. Copy files with wildcards
    Still in the dir with your SAM files, create a new dir called subset. Then, using a single cp command, copy files that satisfy the following conditions into the subset dir:

    • The sample ID/number should be 01-19;

    • The treatment should be A, B, or C.

    Create a README.md in the dir that explains what you did.

Hints

Just like you used multiple consecutive brace expansions above, you can use two consecutive wildcard character sets ([]) here.

  1. Bonus: a trickier wildcard selection
    Still in the dir with your SAM files, create a new dir subset2. Then, copy all files except the one for “sample28 into this dir.

    Do so using a single cp command, though you’ll need two separate wildcard expansion or brace expansion arguments (as in cp wildcard-selection1 wildcard-selection2 destination/).

Hints

  1. Bonus: a trickier renaming loop
    You now realize that your date format is suboptimal (MM-DD-YYYY; which gave 08-14-2020 and 09-16-2020) and that you should use the YYYY-MM-DD format. Use a for loop to rename the files.
Hints

  1. Create a README
    Include a README.md that described what you did; again, try to get familiar with Markdown syntax by using formatting liberally.

Bonus exercises

Exercise 3

If you feel like it would be good to reorganize one of your own, real projects, you can do so using what you’ve learned this week. Make sure you create a backup copy of the entire project first!

Buffalo Chapter 3 code-along

Move back to /fs/ess/PAS1855/users/$USER and download the repository accompanying the Buffalo book using git clone https://github.com/vsbuffalo/bds-files.git. Then, move into the new dir bds-files, and code along with Buffalo Chapter 3.


Solutions

Exercise 2

(1.) Getting set up

mkdir ~/pracs-sp21/w02/ex2/ # or similar, whatever dir you chose
cd !$                       # !$ is a shortcut to recall the last argument from the last commands

(2.) Create a disorganized mock project

An example:

touch sample{001..150}_{F,R}.fastq.gz
touch ref.fasta ref.fai
touch sample_info.csv sequence_barcodes.txt
touch sample{001..150}{.bam,.bam.bai,_fastqc.zip,_fastqc.html} gene-counts.tsv DE-results.txt GO-out.txt
touch fastqc.sh multiqc.sh align.sh sort_bam.sh count1.py count2.py DE.R GO.R KEGG.R
touch Fig{01..05}.png all_qc_plots.eps weird-sample.png
touch dontforget.txt README.md README_DE.md tmp5.txt
touch slurm-84789570.out slurm-84789571.out slurm-84789572.out

(3.) Organize the mock project

An example:

mkdir -p data/{raw,meta,ref}
mkdir -p results/{alignment,counts,DE,enrichment,logfiles,qc/figures}
mkdir -p scripts
mkdir -p figures/{ms,sandbox}
mkdir -p doc/misc
mv *fastq.gz data/raw/
mv ref.fa* data/ref/
mv sample_info.csv sequence_barcodes.txt data/meta/
mv *.bam *.bam.bai results/alignment/
mv *fastqc* results/qc/
mv gene-counts.tsv results/counts/
mv DE-results.txt results/DE/
mv GO-out.txt results/enrichment/
mv *.sh *.R *.py scripts/
mv README_DE.md results/DE/
mv Fig[0-9][0-9]* figures/ms
mv weird-sample.png figures/sandbox
mv all_qc_plots.eps results/qc/figures/
mv dontforget.txt tmp5.txt doc/misc/
mv slurm* results/logfiles/

(4.) Change file permissions

To ensure that no-one has write permission for the raw data, you could, for example, use:

chmod a=r data/raw/*   # set permissions for "a" (all) to "r" (read)

chmod a-w data/raw/*   # take away "w" (write) permissions for "a" (all)

(5.) Create mock alignment files

$ mkdir -p results/alignment
$ # rm results/alignment/* # In the example above, we already had such a dir with files
$ cd results/alignment 

# Create the files:
$ touch sample{01..30}_{A..E}_{08-14-2020,09-16-2020}.sam

# Check if we have 300 files:
$ ls | wc -l
# 300

(6.) Rename files in a batch

# cd results/alignment  # If you weren't already there

# Use *globbing* to match the files to loop over (rather than `ls`):
for oldname in *.sam
do
   # Remove the `sam` suffix using `basename $oldname sam`,
   # use command substitution (`$()` syntax) to catch the output of the
   # `basename` command, and paste `bam` at the end:
   newname=$(basename $oldname sam)bam
   
   # Report what we have:
   # (Using `-e` with echo we can print an extra newline with '\n`,
   # to separate files by an empty line)
   echo "Old name: $oldname"
   echo -e "New name: $newname \n"
   
   # Execute the move:
   mv "$oldname" "$newname"
done

A couple of take-aways:

(7.) Copy files with wildcards

mkdir subset
cp sample[0-1][0-9]_[A-C]* subset/
echo "On $(date), created a dir "subset" and copied only files for samples 1-29 \
and treatments A-D into this dir" > subset/README.md

# See if it worked:
cat subset/README.md

(8.) Bonus: a trickier wildcard selection

mkdir subset2
cp sample{01..27}* sample{29..30}* subset2/
cp sample[^2]* sample2[^8]* subset2/

(9.) Bonus: a trickier renaming loop

for oldname in *.bam
do
   # Use `cut` to extract month, day, year, and a "prefix" that contains
   # the sample number and the treatment, and save these using command substitution:
   month=$(echo "$oldname" | cut -d "_" -f 3 | cut -d "-" -f 1)
   day=$(echo "$oldname" | cut -d "_" -f 3 | cut -d "-" -f 2)
   year=$(basename "$oldname" .bam | cut -d "_" -f 3 | cut -d "-" -f 3)
   prefix=$(echo "$oldname" | cut -d "_" -f 1-2)
   
   # Paste together the new name:
   # (This will fail without quotes around prefix, because the underscore
   # is then interpreted as being part of the variable name.)
   newname="$prefix"_"$year"-"$month"-"$day".bam
   
   # Report what we have:
   echo "Old name: $oldname"
   echo -e "New name: $newname \n"
   
   # Execute the move:
   mv "$oldname" "$newname"
done

This renaming task can be done more succinctly using regular expressions and the sed command – we’ll learn about both of these topics later in the course.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".