Additional data management notes

Optional self-study content for week 9

Author
Affiliation

Jelmer Poelstra

Published

October 20, 2025



This page is still under construction


1 Checking file integrity with checksums

When you receive your FASTQ files from a sequencing facility, or download files from a public repository, a small text file will often accompany these files. This file will have a name along the lines of md5.txt, md5checksums.txt, or shasums.txt.

Such a file contains so-called checksums, a sort of digital fingerprint for files, which can be used to check whether your copy of these files is completely intact. Checksums are extremely compact summaries of files, computed so that even if just one character is changed in the data, the checksum will be different.

More on checksums

Several algorithms and their associated shell commands can compute checksums. Like in our case, you’ll most often see md5 checksums accompany genomic data files, which can be computed and checked with the md5sum command (the newer SHA-1 checksums can be computer and checked with the very similar shasum command).

Checksums consist of hexadecimal characters only: numbers and the letters a-f.

We typically compute or check checksums for one or more files, but we can even do it for a string of text — the example below demonstrates that the slightest change in a string (or file alike) will generate a completely different checksum:

echo "bioinformatics is fun" | md5sum
010b5ebf7e207330de0e3fb0ff17a85a  -
echo "bioinformatic is fun" | md5sum
45cc2b76c02b973494954fd664fc0456  -

Let’s take a look at our checksums — the file has one row per file and two columns, the first with the checksum and the second with the corresponding file name:

head -n 4 /fs/ess/PAS0471/jelmer/assist/2023-08_hy/data/fastq/md5.txt
54224841f172e016245843d4a8dbd9fd        X10790_Cruz-MonserrateZ_Panc1_vec_V1N_1_S31_R2_001.fastq.gz
cf22012ae8c223a309cff4b6182c7d62        X10790_Cruz-MonserrateZ_Panc1_vec_V1N_1_S31_R1_001.fastq.gz
647a4a15c0d55e56dd347cf295723f22        X10797_Cruz-MonserrateZ_Miapaca2_RASD1_V1N_1_S38_R2_001.fastq.gz
ce5d444f8f9d87d325dbe9bc09ef0470        X10797_Cruz-MonserrateZ_Miapaca2_RASD1_V1N_1_S38_R1_001.fastq.gz

This file was created by the folks at the sequencing facility, and now that we have the data at OSC and are ready to analyze it, we can check if they are still fully intact and didn’t –for example– get incompletely transferred.

I have done this check for the original files, but this takes a little while, and for a quick exercise, we can now do so with our subsampled FASTQ files. First, let’s copy a file md5.txt from the demo directory, which has the checksums for the subsampled FASTQ files as I created them:

cp /fs/ess/PAS0471/demo/202307_rnaseq/data/fastq/md5sums.txt data/fastq/

To check whether the checksums for one or more files in a file correspond to those for the files, we can run mdsum -c <mdsum-file>, and should do so while inside the dir with the files of interest1. For example:

cd data/fastq
md5sum -c md5sums.txt 
ASPC1_A178V_R1.fastq.gz: OK
ASPC1_A178V_R2.fastq.gz: OK
ASPC1_G31V_R1.fastq.gz: OK
ASPC1_G31V_R2.fastq.gz: OK
Miapaca2_A178V_R1.fastq.gz: OK
Miapaca2_A178V_R2.fastq.gz: OK
Miapaca2_G31V_R1.fastq.gz: OK
Miapaca2_G31V_R2.fastq.gz: OK

If there were any differences, the md5sum command would clearly warn you about them, as you can see in the exercise below.

Let’s compute a checksum for the README.md file and save it in a file:

# Assuming you went into data/fastq above;
# you need to be in /fs/ess/PAS0471/$USER/rnaseq-intro
cd ../..

md5sum README.md > md5sum_for_README.txt

cat md5sum_for_README.txt
d4c4a8df4870f68808553ac0f5484aa3  README.md

Now, let’s add a line to our README.md that says where the reference genome files are:

# (You'll need single quotes like below, or the shell will interpret the backticks)
echo 'Files for the GRCh38.p14 human genome are in the `data/ref` dir' >> README.md

tail -n 3 README.md
and columns specifying the read direction, sample ID, cell line, and variant.

Files for the GRCh38.p14 human genome are in the `data/ref` dir

Finally, let’s check the checksum, and watch it fail:

md5sum -c md5sum_for_README.txt
README.md: FAILED
md5sum: WARNING: 1 computed checksum did NOT match

The NCBI FTP directory for the human genome also contains a file with checksums, md5checksums.txt.

Let’s download it – we’ll use the -P option to tell wget to put it directly in the data/ref dir:

wget -P data/ref https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/001/405/GCF_000001405.40_GRCh38.p14/md5checksums.txt

It’s a little harder to check the file integrity in these cases because we only have two out of the several dozen files listed in this md5checksums.txt file (namely, all those in the FTP dir), and to make matters worse, we renamed them.

cd data/ref

grep "GCF_000001405.40_GRCh38.p14_genomic.gtf.gz" md5checksums.txt
f573144e507a9fd85150cf6a3c8f8471  ./GCF_000001405.40_GRCh38.p14_genomic.gtf.gz
grep "GCF_000001405.40_GRCh38.p14_genomic.fna.gz" md5checksums.txt
c30471567037b2b2389d43c908c653e1  ./GCF_000001405.40_GRCh38.p14_genomic.fna.gz
md5sum GCF*
689762f267eafe361b6ee4b21638eb51  GCF_000001405.40_GRCh38.p14_genomic.fna
a5274984906df2cc65319dfc1b307a01  GCF_000001405.40_GRCh38.p14_genomic.gtf

3 File compression

Given that the ls size output is not informative for dirs, how can you find out the total size of a dir and all its contents? Use the du (disk usage) command – for example:

  • Get the total size for a single dir:

    # -h: human-readable file sizes / -s: specify a dir
    du -hs fastq/
    941M    fastq/
  • Get the total size for all top-level dirs as well as the current working dir:

    # -d 1: summarize sizes at a depth of 1 (= top-level dirs below current)
    du -h -d 1
    1.0K    ./meta
    5.1M    ./ref
    941M    ./fastq
    946M    ./.git
    1.9G    .

Let’s start with downloading a file to play around with:

wget http://hgdownload.soe.ucsc.edu/goldenPath/hg19/chromosomes/chr22.fa.gz

3.1 Decompress files

When you download data, it is often compressed with gzip (“gzipped”, .gz extension).

Many programs can work with gzipped files directly, but sometimes you do need to unzip them, which can be done with gunzip:

# gunzip unzips in place - the original zipped file disappears:
gunzip chr21.fa.gz
ls chr21*
chr21.fa
# To keep the original, output to stdout with "-c" and redirect:
gunzip -c chr22.fa.gz > chr22.fa
ls chr22*
chr22.fa  chr22.fa.gz

3.2 Compress files

Conversely, to zip files, use gzip:

gzip chr21.fa
ls chr21*
chr21.fa.gz
# As with unzipping, use -c and redirect to keep the original:
gzip -c chr22.fa > chr22_copy.fa.gz
ls chr22*
chr22_copy.fa.gz  chr22.fa  chr22.fa.gz

Often, a program will output unzipped data to standard out, which we can immediately pipe to gzip so we only need to write the compressed file to disk (beneficial because reading/writing to files is time-consuming):

trimmer in.fastq.gz | gzip > out.fastq.gz

(Note that because the input to gzip comes from standard in, the output will by default be to standard out!)

3.3 Working with compressed data directly

Several familiar shell commands have z counterparts that can work with gzipped files directly, such as zgrep and zcat.

  • zgrep is grep for gzipped files:
zgrep -i -n --color "AGATAGATATAT" chr22.fa.gz
589810:TATTGCAGGTAAGATGGGGCCACTCAGTACTTTAAAAAGATAGATATATA
596134:tctatctatatatagatagatatattgtagatatatctatctatatatat
966248:tatagatagatatataaaggggagtttattaagtattaactcacatgatc
  • zcat is cat for gzipped files:
zcat chr22.fa.gz | head
>chr22
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Concatenating files means you combine them into a single file, simply printing them back-to-back. When it is okay to concatenate files2 this can be done very easily.

To add a second FASTQ file to an existing gzipped file,
we can simply use >>:

ls
in.fastq.gz in2.fastq
gzip -c in2.fastq >> in.fastq.gz

Even more conveniently, if we have two gzipped FASTQ file for the same sample but from different Illumina lanes:

cat smpA_L001_R1.fastq.gz smpA_L002_R1.fastq.gz > smpA_R1.fastq.gz

tar

TBA

4 More on file transfer to and from OSC

4.1 Remote transfer commands

For small transfers, an alternative to the OnDemand Files menu is using remote transfer commands like scp, rsync, or rclone. These commands can be more convenient than OnDemand especially if you want to keep certain directories synced between OSC and your computer.

Because at OSC, these transfers will happen using a login node, these commands are unfortunately not recommended for large transfers3.

scp

One option is scp (secure copy), which works much like the regular cp command, including the need for the -r option for recursive transfers. The key difference is that we have to somehow refer to a path on a remote computer, and we do so by starting with the remote computer’s address, followed by :, and then the path:

# Copy from remote (OSC) to local (your computer):
scp <user>@pitzer.osc.edu:<remote-path> <local-path>

# Copy from local (your computer) to remote (OSC)
scp <local-path> <user>@pitzer.osc.edu:<remote-path>

Here are two examples of copying from OSC to your local computer:

# Copy a file from OSC to a local computer - namely, to your current working dir ('.'):
scp jelmer@pitzer.osc.edu:/fs/ess/PAS0471/jelmer/mcic-scripts/misc/fastqc.sh .

# Copy a directory from OSC to a local computer - namely, to your home dir ('~'):
scp -r jelmer@pitzer.osc.edu:/fs/ess/PAS0471/jelmer/mcic-scripts ~

And two examples of copying from your local computer to OSC:

# Copy a file from your computer to OSC --
# namely, a file in from your current working dir to your home dir at OSC:
scp fastqc.sh jelmer@pitzer.osc.edu:~

# Copy a file from my local computer's Desktop to the Scratch dir for PAS0471:
scp /Users/poelstra.1/Desktop/fastqc.sh jelmer@pitzer.osc.edu:/fs/scratch/PAS0471

Some nuances for remote copying:

  • For both transfer directions (remote-to-local and local-to-remote), you issue the copying commands from your local computer.
  • The path for the remote computer (OSC) should always be absolute but that for your local computer can be relative or absolute.
  • Since all files are accessed at the same paths at Pitzer and at other clusters, it doesn’t matter whether you use @pitzer.osc.edu or e.g. @cardinal.osc.edu in the scp command.

If your OneDrive is mounted on or synced to your local computer (i.e., if you can see it in your computer’s file brower), you can also transfer directly between OSC and OneDrive. For example, the path to my OneDrive files on my laptop is:
/Users/poelstra.1/Library/CloudStorage/OneDrive-TheOhioStateUniversity.
So if I had a file called fastqc.sh in my top-level OneDrive dir, I could transfer it to my Home dir at OSC as follows:

scp /Users/poelstra.1/Library/CloudStorage/OneDrive-TheOhioStateUniversity jelmer@pitzer.osc.edu:~

rsync

Another option, which I recommend, is the rsync command, especially when you have directories that you repeatedly want to sync: rsync won’t copy any files that are identical in source and destination.

A useful combination of options is -avz --progress:

  • -a enables archival mode (among other things, this makes it work recursively).
  • -v increases verbosity — tells you what is being copied.
  • -z enables compressed file transfer (=> generally faster).
  • --progress to show transfer progress for individual files.

The way to refer to remote paths is the same as with scp. For example, I could copy a dir_with_results in my local Home dir to my OSC Home dir as follows:

rsync -avz --progress ~/dir_with_results jelmer@pitzer.osc.edu:~

One tricky aspect of using rsync is that the presence/absence of a trailing slash for source directories makes a difference for its behavior. The following commands work as intended — to create a backup copy of a scripts dir inside a dir called backup4:

# With trailing slash: copy the *contents* of source "scripts" into target "scripts":
rsync -avz scripts/ backup/scripts

# Without trailing slash: copy the source dir "scripts" into target dir "backup"
rsync -avz scripts backup

But these commands don’t:

# This would result in a dir 'backup/scripts/scripts':
rsync -avz scripts backup/scripts

# This would copy the files in "scripts" straight into "backup":
rsync -avz scripts/ backup

4.2 Command-line SFTP

The first of two options for larger transfers is SFTP. You can use the sftp command when you have access to a Unix shell on your computer, and this what I’ll cover below.

Logging in

To log in to OSC’s SFTP server, issue the following command in your local computer’s terminal, substituting <user> by your OSC username:

sftp <user>@sftp.osc.edu   # E.g., 'jelmer@sftp.osc.edu'
The authenticity of host 'sftp.osc.edu (192.148.247.136)' can't be established.
ED25519 key fingerprint is SHA256:kMeb1PVZ1XVDEe2QiSumbM33w0SkvBJ4xeD18a/L0eQ.
This key is not known by any other names
Are you sure you want to continue connecting (yes/no/[fingerprint])?

If this is your first time connecting to OSC SFTP server, you’ll get a message like the one shown above: you should type yes to confirm.

Then, you may be asked for your OSC password, and after that, you should see a “welcome” message like this:

******************************************************************************

This system is for the use of authorized users only.  Individuals using
this computer system without authority, or in excess of their authority,
are subject to having all of their activities on this system monitored
and recorded by system personnel.  In the course of monitoring individuals
improperly using this system, or in the course of system maintenance,
the activities of authorized users may also be monitored.  Anyone using
this system expressly consents to such monitoring and is advised that if
such monitoring reveals possible evidence of criminal activity, system
personnel may provide the evidence of such monitoring to law enforcement
officials.

******************************************************************************
Connected to sftp.osc.edu.

Now, you will have an sftp prompt (sftp>) instead of a regular shell prompt.

Familiar commands like ls, cd, and pwd will operate on the remote computer (OSC, in this case), and there are local counterparts for them: lls, lcd, lpwd — for example:

# NOTE: I am prefacing sftp commands with the 'sftp>' prompt to make it explicit
#       these should be issued in an sftp session; but don't type that part.
sftp> pwd
Remote working directory: /users/PAS0471/jelmer
sftp> lpwd
Local working directory: /Users/poelstra.1/Desktop

Uploading files to OSC

To upload files to OSC, use sftp’s put command.

The syntax is put <local-path> <remote-path>, and unlike with scp etc., you don’t need to include the address to the remote (because in an stfp session, you are simultaneously connected to both computers). But like with cp and scp, you’ll need the -r flag for recursive transfers, i.e. transferring a directory and its contents.

# Upload fastqc.sh in a dir 'scripts' on your local computer to the PAS0471 Scratch dir:
sftp> put scripts/fastqc.sh /fs/scratch/PAS0471/sandbox

# Use -r to transfer directories:
sftp> put -r scripts /fs/scratch/PAS0471/sandbox

# You can use wildcards to upload multiple files:
sftp> put scripts/*sh /fs/scratch/PAS0471/sandbox
sftp is rather primitive

The ~ shortcut to your Home directory does not work in sftp! sftp is generally quite primitive and you also cannot use, for example, tab completion or the recalling of previous commands with the up arrow.

Downloading files from OSC

To download files from OSC, use the get command, which has the syntax get <remote-path> <local-path> (this is the other way around from put in that the remote path comes first, but the same in that both use the order <source> <target>, like cp and so on).

For example:

sftp> get /fs/scratch/PAS0471/mcic-scripts/misc/fastqc.sh .

sftp> get -r /fs/scratch/PAS0471/sandbox/ .

Closing the SFTP connection

When you’re done, you can type exit or press Ctrl+D to exit the sftp prompt.

4.3 Globus

The second option for large transfers is Globus, which has a browser-based GUI, and is especially your best bet for very large transfers. Some advantages of using Globus are that:

  • It checks whether all files were transferred correctly and completely
  • It can pause and resume automatically when you e.g. turn off your computer for a while
  • It can be used to share files from OSC directly with collaborators even at different institutions.

Globus does need some setup, including the installation of a piece of software that will run in the background on your computer.

Back to top

Footnotes

  1. This technically depends on how the file names are shown in the text file with the checksums: if there are just file names without directories (or ./<filename>, etc.), you’ll have to be in the dir with the files to run md5sum -c. (This in turn depends on from where the checksums were generated: if you generate them while in the dir with the focal files, which is the only sensible way to do this, that’s how they will be displayed.)↩︎

  2. This does mean you’ll lose information about the origin of the original files↩︎

  3. This may be different at other supercomputer centers: there are no inherent transfer size limitations to these commands.↩︎

  4. For simplicity, these commands are copying between local dirs, which is also possible with rsync.↩︎