The goal for your final project is to apply some of the things you have learned during this course, and allow you to get you more practice. While some aspects are required in order to get a good grade (see Graded aspects below), you have a fair amount of freedom. I recommend that you take advantage of that to make your final project as useful as possible for your own research and/or personal development.
Graded aspects for the project focus on documentation, reproducibility, and automation. Accordingly, there are no real requirements for the level of complexity or sophistication, the number of scripts, the real-world usefulness, and so on. I would in fact recommend you take care not to be too ambitious in this regard: start small and then expand if you are able to.
Some examples of possible types of projects:
You may want to work with your own data or a publicly available dataset that is similar to what you are expecting to work with (though subsetting the data will be useful if you have a large genomic dataset!).
You may decide that just focusing on the coding and on creating a reproducible workflow is what is most useful for you, and that this may be easier with a trivial and maybe not-so-relevant dataset so you don’t get bogged down in other details.
You may think of trying to automate something “boring” and/or repetitive that you did manually until now.
If you would like to work with publicly available fungal genomic data, you should consider using your TA Zach’s MycoTools
. You can find some general information in his README and more details usage information in his usage guide. If this looks interesting to you, contact Zach to get more information and to develop a project idea.
If you need help with selecting a dataset or a project topic, don’t hesitate to contact me (Jelmer). I don’t currently have any ready-to-go project ideas with accompanying data sets, but I could make one or more, if needed.
Graded aspects of your project focus on appropriate usage of many of the tools and principles we covered during the course.
To receive a high grade, your project should:
Be well-organized: contained in a single parent directory with a clear and sensible structure of subdirectories, descriptive file and directory names, no files floating around with unclear purpose or source, and so on.
Be well-documented, with at least one README in Markdown format in the root directory of your project, and preferably with additional READMEs elsewhere as appropriate.
Be version-controlled with Git throughout: with regular, meaningful commits. You will also need to push to GitHub for the proposal, draft, and final submission checkpoint.
Contain scripts in Bash and/or Python that do data processing and/or analysis (see also the last point). Try not to do any manual work such as editing a data file in a text editor or Excel to fix column names, since this hinders reproducibility.
Having some components in other languages is fine, in case you know how to do things there that we didn’t learn in the course (for example, doing some plotting in R).
Similarly, it is fine to call external, command-line programs (e.g. some of the bioinformatics programs we have run, or any program that may be useful for your research).
If your project’s heavy lifting consists mostly or almost entirely of calling such programs, though, it becomes especially important that you are creating a reproducible and easily rerunnable pipeline (see below).
Run one or more scripts as SLURM jobs at OSC.
(A locally run project could be okay if your research does not and will not require the use of supercomputers like those at OSC – check in with me if this is the case and you would therefore prefer not to use OSC.)
Be easily re-runnable using a Snakefile or a “master” / “controller” script that glues the entire project together.
It is not enough to include a README.md
that explains what you did or that provides instructions how to rerun the project, even if in detail. Such information is very useful, but the goal here is to provide a way to rerun the project with a single command – a command that I could also run.
Preferably, this is done using single Snakemake control script (“Snakefile”), but a shell or Python script may also work.
(This is a challenging aspect, and your grade won’t plummet if you don’t succeed, but it is important to try!)
Information about expectations for each of the steps (checkpoints) for the project will be provided in the content for individual weeks. Below is an overview which will be updated with the links:
Proposal – due March 23
Draft – due April 13
In-class presentations – April 20 and 22
Final submission – due April 30
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".