You can do exercises 1 and 2 in a shell either at OSC or on your local computer: whichever you prefer. In exercise 3, you will be transferring files to or from OSC, which should be done from a local shell.
In this exercise, you’ll download a set of files associated with a genome assembly of one organism from NCBI. As an example, I’m using the genome assembly files for Phytophthora nicotianae var. parasitica (also known simply as P. parasitica), one of several species of fungus that causes buckeye rot of tomato.
Note, if you are more interested in getting the equivalent files for another organism, you should be able to follow the same steps.
Go to https://www.ncbi.nlm.nih.gov. NCBI has a number of different databases, including “SRA” for short next-generation sequencing reads, and “PubMed” for publications. In the drop-down menu that says All Databases next to the Search box, select Assembly and type Phytophthora parasitica: we want to search for whole-genome assemblies for this species.
Click the assembly in the prominent box at the top that says Phyt_para_CJ02B3_V1.
Here is the link to the assembly, in case things look different for you: https://www.ncbi.nlm.nih.gov/assembly/GCA_000509465.1. (But again, you can also download files from a different organism or assembly.)
Now, you should be on the assembly page. There is a big blue button Download Assembly, but if you click on it, you can see that you can only download one file at a time. Since we want all 20 or so files associated with the assembly, we will use wget instead. Note that this would scale easily even if we wanted files for multiple assemblies (for instance, there are 9 different assemblies for P. parasitica, as the page reports).
Click on FTP directory for GenBank assembly in the top list on the right hand side.
Use a wget command to download all the files you see on the assembly page, using the URL in the address bar.
In addition to the option for downloading multiple files at once (and optionally, other options to optimize the download), make sure you use the following options:
--no-parent — Stop wget from moving up via the Parent Directory link at the top of the page, which would lead it to start downloading files from other assemblies…
-e robots=off — Ignore robots.txt to enable the download (as specified in NCBI’s download FAQ).
If you see errors saying that filenames are too long, you’ll have to use the --no-host-directories and --cut-dirs=6 options as further detailed in the Hints, because all the nested directories that are being created may end up creating paths that are too long.
Your command should be structured as follows:
$ wget --no-parent -e robots=off <other-options> <link-to-page>The link to the page is: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/509/465/GCA_000509465.1_Phyt_para_CJ02B3_V1/
Use the --recursive option to download more than one file at once.
By default, the files will be placed inside several levels of directories:
# ftp.ncbi.nlm.nih.gov/
# └── genomes
# └── genbank
# └── protozoa
# └── Phytophthora_parasitica
# └── latest_assembly_versions
# └── GCA_000509525.1_Phyt_para_IAC_01_95_V1To avoid this, you can use the options --no-host-directories and --cut-dirs=6 as suggested in the NCBI’s download FAQ. And if you do so, you’ll also want to specify an arbitrary target dir name with -P, e.g. -P ncbi_Pparasiticus.
If you want to play around with different options, I suggest you use the --accept option to only select one or two small files to be downloaded, e.g. --accept "*assembly_stats*".
Don’t hesitate to look at the solution if your first attempts fail. The minutiae of wget commands aren’t terribly important – the next exercise, while easier, is more so.
Move inside the directory that you’ve downloaded where all the assembly files are. You should be seeing over 20 files, including GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_report.txt and GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz.
md5checksums.txt. Use this file in a single md5sum command to verify that all the downloaded files are identical to the source, i.e. that they have the correct checksums.
Use the -c (check) option to md5sum to cross-check the checksums of the files in your directory with those in the file you specify.
Count the number of lines in the genome sequence (GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.fna.gz) that contain CAGCAGCAG, without unzipping the file.
(Side note in case you are wondering about the fna extension: this explicitly denotes a FASTA file with nucleotide sequences, as opposed to one with amino acid sequences, .faa.)
zgrep will allow you to search in a gzipped file.-c option will count matches.
gunzip <filename>.gzip <filename>.ls or du.
You can simply use the same command you used in the first step of this exercise.
Transfer the entire directory you downloaded above to OSC (in case you have been working locally so far) or to your local computer (in case you have been working at OSC so far).
You can use rsync and/or SFTP, whichever you prefer to get some practice with.
Edit Fri, Feb 26: We did not end up discussing SFTP in class. But if you are interested in getting experience with this (at OSC, it is recommended for larger transfers since the transfer does not use a login node like scp andrsync do), have a look at this tutorial. The hint below will tell you how you can log in to the OSC SFTP server.
rsync
sftp
Log in using:
sftp sftp.osc.eduLike many commands, the put and get functions need the -r option to work recursively.
In sftp, the ~ shortcut for your home dir does not work.
wget command to download all these files at once.
The wget command with the fewest options that works for our goal is:
$ wget --recursive -e robots=off --no-parent \
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/Phytophthora_parasitica/latest_assembly_versions/GCA_000509525.1_Phyt_para_IAC_01_95_V1/These options (--recursive -e robots=off --no-parent) were all discussed in the question and the hints.
If you wanted to place these files in a dir with a name of your choice, and no unnecessarily nest directories (see the hint above), you could use:
--no-host-directories in combination with --cut-dirs=6 will remove the 6 levels of nested directories you got earlier (ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/Phytophthora_parasitica/latest_assembly_versions).
-P ncbi_Pparasiticus/ to name your directory.
Furthermore, since we are not interested in index.html (the actual HTML we are looking at on the web), we can exclude it using --reject "index.html".
All in all, this would result in the following command:
$ wget --recursive -e robots=off --no-parent \
--reject "index.html" \
--no-host-directories --cut-dirs=6 \
-P ncbi_Pparasiticus/ \
https://ftp.ncbi.nlm.nih.gov/genomes/genbank/protozoa/Phytophthora_parasitica/latest_assembly_versions/GCA_000509525.1_Phyt_para_IAC_01_95_V1/ This should result in the following set of files being downloaded (one is directory, in fact):
$ ls ncbi_Pparasiticus # Or whatever dir your downloads are in
#> annotation_hashes.txt
#> assembly_status.txt
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_report.txt
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_stats.txt
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_structure
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_feature_count.txt.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_feature_table.txt.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.fna.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic_gaps.txt.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.gbff.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.gff.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.gtf.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_protein.faa.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_protein.gpff.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_rm.out.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_rm.run
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_rna_from_genomic.fna.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_translated_cds.faa.gz
#> GCA_000509525.1_Phyt_para_IAC_01_95_V1_wgsmaster.gbff.gz
#> index.html.tmp
#> md5checksums.txt
#> README.txtmd5sum.
The -c option, when provided with a filename, will check the checksums of the files in your current directory against those in the file:
$ md5sum -c md5checksums.txt
#> ./annotation_hashes.txt: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_report.txt: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_stats.txt: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_structure/Primary_Assembly/component_localID2acc: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_structure/Primary_Assembly/scaffold_localID2acc: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_structure/Primary_Assembly/unplaced_scaffolds/AGP/unplaced.scaf.agp.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_assembly_structure/Primary_Assembly/unplaced_scaffolds/FASTA/unplaced.scaf.fna.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_feature_count.txt.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_feature_table.txt.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.fna.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic_gaps.txt.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.gbff.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.gff.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_genomic.gtf.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_protein.faa.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_protein.gpff.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_rm.out.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_rm.run: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_rna_from_genomic.fna.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_translated_cds.faa.gz: OK
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_wgsmaster.gbff.gz: OKIf all went well, all files should say OK.
If there was a mismatch somewhere, you should also see a separate warning at the end of the output, like:
#> md5sum: WARNING: 1 computed checksum did NOT matchCAGCAGCAG.
$ zgrep -c "CAGCAGCAG" GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz
#> 1426
$ ls -lh GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz
#> -rw-rw-r-- 1 jelmer jelmer 8.8M Dec 21 2017 GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz
$ gunzip GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz
$ ls -lh GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna
#> -rw-rw-r-- 1 jelmer jelmer 38M Dec 21 2017 GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna
$ gzip GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fnaThe zipped file is less than a quarter of the size of the unzipped file.
$ md5sum -c md5checksums.txt
#> ... (showing the pertinent part of the output)
#> ./GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz: FAILED
#> ...
#> md5sum: WARNING: 1 computed checksum did NOT matchNo, it no longer matched! This is merely due to the fact that the gzipped file contains metadata about when it was zipped.
However, this is another reason not to modify any original files. If you needed an unzipped file, it would be better to unzip it to a separate file using the -c option:
$ gunzip -c GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fna.gz > \
GCA_000509525.1_Phyt_para_IAC_01_95_V1_cds_from_genomic.fnarsync
Regardless of the direction of the transfer, these commands should be executed in a local shell.
From local to OSC, assuming that your files were downloaded in a dir called ncbi_Pparasiticus in your home dir, your username is me, and you made a dir week07/exercises/ for these exercises at OSC (note: rsync will not create “missing” directories!):
$ rsync -avz --progress \
~/ncbi_Pparasiticus \
me@pitzer.osc.edu:/fs/ess/PAS1855/users/me/week07/exercises/The above will result in you having an OSC dir /fs/ess/PAS1855/users/me/week07/exercises/ncbi_Pparasiticus. If you would have included a trailing slash in ~/ncbi_Pparasiticus above, you would have copied the contents of ncbi_Pparasiticus directly into the exercises dir, which is probably not what you wanted.
From OSC to local, assuming that your files were downloaded in a dir called ncbi_Pparasiticus in /fs/ess/PAS1855/users/me/week07/exercises/ your username is me, and you copy the downloaded dir straight into your home dir:
$ rsync -avz --progress \
me@pitzer.osc.edu:/fs/ess/PAS1855/users/me/week07/exercises/ncbi_Pparasiticus \
~The above will result in you having a local dir ~/ncbi_Pparasiticus. If you would have included a trailing slash in ~/ncbi_Pparasiticus above, you would have copied the contents of ncbi_Pparasiticus directly into your home dir, which is probably not what you wanted.
From local to OSC:
Assuming that your local home ($HOME / ~) directory is /home/me, and your OSC username is me:
$ sftp sftp.osc.edu
sftp> put -r /home/me/ncbi_Pparasiticus/ /fs/ess/PAS1855/users/me/week07/exercises/The above will result in you having an OSC dir /fs/ess/PAS1855/users/me/week07/exercises/ncbi_Pparasiticus.
From OSC to local
Assuming that your local home ($HOME / ~) directory is /home/me, and your OSC username is me:
$ sftp sftp.osc.edu
sftp> get -r /fs/ess/PAS1855/users/me/week07/exercises/ncbi_Pparasiticus/ /home/me/The above will result in you having a local dir ~/ncbi_Pparasiticus.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".