class:inverse middle center ## *Week 2: <br> Project Organization and Markdown* ---- # Part I: Project Organization <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021-01-19 --- ## Goals for this week - This presentation: - Learn some best practices for project organization, documentation, and management. <br> - [Second slide deck (today)]((https://mcic-osu.github.io/pracs-sp21/posts/week-02/slides/02-2-shfiles.html): - Get to know our text editor, *VS Code*, a bit better. - Learn how to use *Markdown* for documentation (and beyond). <br> - [Third slide deck (Thursday)]((https://mcic-osu.github.io/pracs-sp21/posts/week-02/slides/02-3-md.html): - Learn how to manage files in the Unix shell. --- ## Importance of good project organization <br> **Good project organization and documentation facilitates:** - Collaborating with others (and yourself in the future...) - Reproducibility - Automation - Version control <br> In short, it is a necessary foundation to use this course's tools and to reach some of its goals. --- ## Project organization – some underlying principles - **One dir hierarchy for one project** — in other words: - Don't keep multiple distinct projects inside one dir. - Don't keep files for one project in multiple places. <br> -- - Within the dir hierarchy: - **Separate code from data** - **Separate raw data from processed data** <br> -- - **Treat raw data as read-only** (and treat generated output as disposable). - **Use good dir and file naming**. - **Slow down and document.** --- ## One dir hierarchy for one project When you have a single directory hierarchy for each project, it is: - Easier to: find files, share your project, not throw away stuff in error, ... - Possible to use *relative paths* within a project's scripts => more portable. <br> <p align="center"> <img src=img/proj-dirs-1.svg width=65%> </p> --- ## One dir hierarchy for one project (cont.) <p align="center"> <img src=img/proj-dirs-2.svg width=100%> </p> --- ## One dir hierarchy for one project (cont.) <p align="center"> <img src=img/proj-dirs-3.svg width=100%> </p> --- ## One dir hierarchy for one project (cont.) ### Absolute versus relative path recap: - **Absolute paths** start from the root directory, break when moving a project, and don't generally work across computers. - **Relative paths** start from a *working dir*: when you always use the root of the *project* as the working dir, paths work when moving the project within and between computers. --- ## One dir hierarchy for one project (cont.) ### But how to define and separate projects? From [Wilson et al 2017 - Good Enough Practices in Scientific Computing](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510): > *As a rule of thumb, divide work into projects based on the overlap in data and* > *code files.* > > *If 2 research efforts share no data or code,* > *they will probably be easiest to manage independently.* > > *If they share more than half of their data and code, they are probably best* > *managed together.* > *If you are building tools that are used in several projects,* > *the common code should probably be in a project of its own.* --- ## Projects with shared data or code - To access files outside of the project (e.g., shared across projects), it is easiest to create **links** to these files: <p align="center"> <img src=img/proj-dirs-4.svg width=60%> </p> --- ## Projects with shared data or code (cont.) - But shared data or scripts are generally better stored in separate dirs, and then linked to by each project using them: <p align="center"> <img src=img/proj-dirs-5.svg width=60%> </p> -- .content-box-warning[ These strategies do decrease the portability of your project, and moving the shared files even within your own computer will cause links to break. ] --- ## Projects with shared data or code (cont.) A more portable method is to *keep shared (multi-project) files online* – this is especially feasible for **scripts** under version control: <p align="center"> <img src=img/proj-dirs-6.svg width=70%> </p> -- .content-box-info[ For **data**, this is also possible but often not practical due to file sizes. It's easier after data has been deposited in a public repository. ] --- ## Example project dir structures .pull-left[ <p align="left"> <img src=img/proj-ex1.png width=100%> </p> ] --- ## Example project dir structures .pull-left[ <p align="left"> <img src=img/proj-ex-comb.png width=100%> </p> ] .pull-right[ **Several things depend on preferences and project specifics:** - `data` as single top-level dir? - `results` vs `analysis` (Buffalo) - `src` (source) vs `scripts` vs `workflow` - Where to put figures – `results/plots/qc` or `results/qc/plots/`? ] -- .content-box-info[ **Other best practices:** - Use subdirectories liberally (e.g. within `analysis`, `scripts`). - Also add *READMEs* within dirs. ] --- ## File naming **Three principles for good file names** ([Jenny Bryan](https://speakerdeck.com/jennybc/how-to-name-files)): - Machine-readable - Human-readable - Playing well with default ordering --- ## File naming: Machine-readable Consistent and informative naming helps to *programmatically find and process files.* - In file names, provide **metadata** like Sample ID, date, and treatment: - `sample032_2016-05-03_low.txt` - `samples_soil_treatmentA_2019-01.txt` - Now, you can use **globbing** to select samples from a certain month or treatment (*more about this on Thursday*): ```sh $ ls *2016-05* $ ls *treatmentA* ``` -- .content-box-info[ *Suffices*, like `.txt`, `.md`, and `.csv` are generally not necessary in Unix, but are still useful to distinguish file types easily and programmatically. ] --- ## File naming: Machine-readable (cont.) - Spaces in file names lead to inconvenience or trouble: ```sh $ mkdir "raw sequences" $ mkdir sequences $ rm -rf raw sequences ``` - More generally, only use the following in file names: - Alphanumeric characters <kbd>A-Za-z0-9</kbd> - Underscores <kbd>_</kbd> - Hyphens (dashes) <kbd>-</kbd> - Periods (dots) <kbd>.</kbd> --- ## File naming: Human-readable > *"Name all files to reflect their content or function.* > *For example, use names such as bird_count_table.csv, manuscript.md,* > *or sightings_analysis.py.*" > [Wilson et al. 2017](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005510) --- ## File naming: Combining machine- and human-readable - *One* good way (opinionated recommendations): - Use **underscores <kbd>_</kbd> to delimit units** you may later want to separate on (extract with `cut`, etc.): sampleID, batch, treatment, date. - Within such units, use **dashes <kbd>-</kbd> to delimit words**: `grass-samples`. - Limit the use of **periods <kbd>.</kbd> to indicate file extensions**. - (Generally *avoid capitals*.) <br> - For example: ```sh mmus001_treatmentA_filtered-mq30-only_sorted_dedupped.bam mmus002_treatmentA_filtered-mq30-only_sorted_dedupped.bam . . . mmus086_treatmentG_filtered-mq30-only_sorted_dedupped.bam ``` --- ## File naming: Ordering - Use **leading zeros** for lexicographic sorting: `sample005`. - **Dates** should always be written as `YYYY-MM-DD`: `2020-10-11`. - Can **group similar files together** by starting with same phrase, and **number scripts** according to their execution order: ```sh DE-01_normalize.R DE-02_test.R DE-03_process-significant.R ``` --- ## Slow down and document **Use README files to document:** - Your methods - Where/when/how each data and metadata file originated - Versions of software, databases, reference genomes - *...Everything needed to rerun whole project* <br> .content-box-info[ See this week's Buffalo chapter (Ch. 2) for further details. ] --- ## Slow down and document – Use plain text files **Plain text files offer several benefits over proprietary & binary formats** (like `.docx` and `.xlsx`): - Can be accessed on any computer, including over remote connections - Are future-proof - Allow to be version-controlled <br> -- .content-box-answer[ These considerations apply not just to files for documentation, but also to data files, etc! ] --- class: inverse middle center # Questions? ---- <br> <br> <br> <br> --- background-color: #f2f5eb name: software ## Bonus: Managing software - Keep track of the versions of software that you use! - At OSC, load modules *with* a version specification -- otherwise, you may not know what you used, and default versions changes can go unnoticed. ```sh # Load default -- you can bet this will be a different one in a year or two module load R # Preferred: module load R/4.0.2 ``` - `conda` environments can be useful too, e.g. for maintaining multiple versions of the same software, or being able to run mutually incompatible software. .content-box-info[ We'll talk more about this in the OSC (Week 6) and Snakemake (Week 12) modules. ]