Say you try to pipe ls
output into the imaginary program process_qc
:
$ ls *.fq | process_fq
Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments.
Say you try to pipe ls
output into the imaginary program process_qc
:
$ ls *.fq | process_fq
Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments.
This problem wouldn't occur with direct globbing:
$ process_fq *.fq
However, a different problem could occur here: if the number of globbed files is very high, you can get an error that the "argument list is too long".
find
and xargs
find
and xargs
can deal with both of these problems.
More generally:
find
is useful for file searching (and processing) when you need more
than what ls
can do and/or are worried about spaces.
xargs
is useful for multi-file and parallel processing without loops.
Let's create some files to try these commands:
$ mkdir -p zmays-snps/{data/seqs,scripts,analysis}$ touch zmays-snps/data/seqs/zmays{A,B,C}_R{1,2}.fastq
find
basicsThe basic syntax for find
is:find [path] [expression-1] [expression-2] ...
.
find
will print all files by default, like ls
–
but unlike ls
, it is recursive by default:
$ find zmays-snps # zmays-snps is the path we are operating on#> zmays-snps/#> zmays-snps/data#> zmays-snps/data/seqs#> zmays-snps/data/seqs/zmaysC_R1.fastq#> zmays-snps/data/seqs/zmaysA_R1.fastq#> zmays-snps/data/seqs/zmaysB_R1.fastq#> zmays-snps/data/seqs/zmaysB_R2.fastq#> zmays-snps/data/seqs/zmaysC_R2.fastq#> zmays-snps/data/seqs/zmaysA_R2.fastq#> zmays-snps/scripts#> zmays-snps/analysis$ ls zmays-snps#> analysis data scripts
find
basics # Let's move into the zmays-snps dir: $ cd zmays-snps
Using the -name
option, we can also glob like we would do with ls
:
$ find data/seqs -name "zmaysB*fastq"#> data/seqs/zmaysB_R1.fastq#> data/seqs/zmaysB_R2.fastq
find
functionality beyond that of ls
We can use -type
as an expression to limit results by file type:
# "-type f" to only match regular files $ find data/seqs -name "zmaysB*fastq" -type f
find
functionality beyond that of ls
We can use -type
as an expression to limit results by file type:
# "-type f" to only match regular files $ find data/seqs -name "zmaysB*fastq" -type f
In the previous command, the two expressions were implicitly connected with
a logical and: find regular files with "zmaysB*fastq" in the name.
We could also do this explicitly using the -and
operator:
$ find data/seqs -name "zmaysB*fastq" -and -type f
Or we could connect expression with a logical or using the -or
operator:
$ find data/seqs -name "zmaysA*fastq" -or -name "zmaysC*fastq"
find
functionality beyond that of ls
We can also negate expressions – but note that we need to quote the !
so it doesn't get evaluated by the shell:
# Find regular fies that do NOT match "zmaysC*fastq"$ find data/seqs -type f "!" -name "zmaysC*fastq"
Say we have some files with temp in their names that we want to ignore:
$ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq$ find data/seqs -type f "!" -name "zmaysC*fastq" \ -and "!" -name "*-temp*"
find
functionality beyond that of ls
We can also negate expressions – but note that we need to quote the !
so it doesn't get evaluated by the shell:
# Find regular fies that do NOT match "zmaysC*fastq"$ find data/seqs -type f "!" -name "zmaysC*fastq"
Say we have some files with temp in their names that we want to ignore:
$ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq$ find data/seqs -type f "!" -name "zmaysC*fastq" \ -and "!" -name "*-temp*"
But what does a !
mean to the shell? It accesses past commands:
# To re-execute your most recent command:$ !!
find
's resultsNow, we want to remove these temporary files,
which we can do using -exec
followed by our rm
command
(using rm -i
for interactive removal):
$ find data/seqs -name "*-temp.fastq" -exec rm -i {} \;
Here, {}
is the placeholder for the file names we are passing on,
and we have to indicate the end of our command with an escaped semicolon: \;
.
-exec
can be very useful to process file queries that are more
complicated than what a simple globbing pattern allows for.
find
's resultsNow, we want to remove these temporary files,
which we can do using -exec
followed by our rm
command
(using rm -i
for interactive removal):
$ find data/seqs -name "*-temp.fastq" -exec rm -i {} \;
Here, {}
is the placeholder for the file names we are passing on,
and we have to indicate the end of our command with an escaped semicolon: \;
.
-exec
can be very useful to process file queries that are more
complicated than what a simple globbing pattern allows for.
With -exec
, you can use any command to operate on the files that you found.
But as it happens, because it is such a common operation to remove files found
with find
, there is a -delete
shortcut for this:
find zmays-snps/data/seqs -name "*-temp.fastq" -delete
xargs
xargs
is to be read/pronounced as x-args
, where "args" means "arguments".
This function passes arguments, supplied via standard in, on to another command. For example:
$ find data/seqs -name "*-temp.fastq" | xargs rm
This is equivalent to the following, where rm
receives the very same list
of files as its arguments:
$ rm data/seqs/*-temp.fastq
xargs
xargs
is to be read/pronounced as x-args
, where "args" means "arguments".
This function passes arguments, supplied via standard in, on to another command. For example:
$ find data/seqs -name "*-temp.fastq" | xargs rm
This is equivalent to the following, where rm
receives the very same list
of files as its arguments:
$ rm data/seqs/*-temp.fastq
However, now, we can bypass the maximum number of arguments to rm
,
because we can easily pass one argument at a time with -n 1
:
$ find data/seqs -name "*-temp.fastq" | xargs -n 1 rm
This command runs rm
as many times as there are "*-temp.fastq" files.
xargs
By using xargs
instead of find -exec
, we separate the process that finds
the files we want to operate on, from the actual operations.
This can be beneficial, for instance if we want to carefully inspect a list of files before we do anything:
$ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq# We save the find results in a text file$ find data/seqs -name "*-temp.fastq" > files-to-delete.txt$ cat files-to-delete.txt#> data/seqs/zmaysC_R1-temp.fastq#> data/seqs/zmaysA_R1-temp.fastq#> data/seqs/zmaysA_R2-temp.fastq#> data/seqs/zmaysC_R2-temp.fastq# Now we pass that list of files to rm:$ cat files-to-delete.txt | xargs rm
xargs
with replacement stringsWe can also specify where we want to insert the arguments that are being passed on, as is often necessary:
$ find data/seqs -name "*.fastq" | \ xargs basename -s ".fastq" | \ xargs -I{} fastq_stat --in {}.fastq --out results/{}.txt
Above, we use -I{}
to specify that the arguments will be passed on with the
placeholder {}
. As you can see, we are also able to use {}
multiple times.
xargs
with replacement stringsWe can also specify where we want to insert the arguments that are being passed on, as is often necessary:
$ find data/seqs -name "*.fastq" | \ xargs basename -s ".fastq" | \ xargs -I{} fastq_stat --in {}.fastq --out results/{}.txt
Above, we use -I{}
to specify that the arguments will be passed on with the
placeholder {}
. As you can see, we are also able to use {}
multiple times.
This functionality is similar to what you may be inclined to do in a loop instead – the above is equivalent to:
$ for $fastq_file in data/seqs/*fastq; do fastq_short=$(basename $fastq_file .fastq) fastq_stat --in $fastq_file --out results/$fastq_short.txt done
xargs
and parallelizationOne advantage of using xargs
instead of a for
loop is that we have control
over parallelization.
To parallelize, i.e. to spawn multiple simultaneous process,
we can send each process to the background in the for
loop using &
:
$ for $fastq_file in *fastq; do fastq_short=$(basename $fastq_file .fastq) fastq_stat --in $fastq_file --out results/$fastq_short.txt &done
However, this would spawn exactly as many processes as there are input files, which could be way more than what we want.
xargs
and parallelizationOne advantage of using xargs
instead of a for
loop is that we have control
over parallelization.
To parallelize, i.e. to spawn multiple simultaneous process,
we can send each process to the background in the for
loop using &
:
$ for $fastq_file in *fastq; do fastq_short=$(basename $fastq_file .fastq) fastq_stat --in $fastq_file --out results/$fastq_short.txt &done
However, this would spawn exactly as many processes as there are input files, which could be way more than what we want.
With xargs
, we can control the number of simultaneous processes that
are spawned using -P
(here set to 6):
$ find data/seqs -name "*.fastq" | \ xargs basename -s ".fastq" | \ xargs -P 6 -I{} fastq_stat --in {}.fastq --out results/{}.txt
xargs
and spaces in file namesAs always, spaces in file names can trip you, but you were promised a solution
with find
and xargs
.
But out of the box, this does in fact fail:
$ touch "samples A.txt" "samples B.txt" $ find . -name "samples [AB].txt" | xargs rm#> rm: cannot remove './samples': No such file or directory#> rm: cannot remove 'A.txt': No such file or directory#> rm: cannot remove './samples': No such file or directory#> rm: cannot remove 'B.txt': No such file or directory
The solution is to use -print0
with find
, so it prints files with a
"null byte" as the delimiter.
Then, we also need to tell xargs
about this, using -0
:
$ find . -name "samples [AB].txt" -print0 | xargs -0 rm -v#> removed './samples A.txt'#> removed './samples B.txt'
If you like the xargs
approach to running multiple processes,
you may also be interested in the parallel
command,
which is even more powerful than xargs
.
I personally don't do this very much because I tend to submit separate batch jobs
for each file, which can easily be done with simple for
loops.
Say you try to pipe ls
output into the imaginary program process_qc
:
$ ls *.fq | process_fq
Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments.
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |