+ - 0:00:00
Notes for current slide
Notes for next slide

Week 6: Shell scripting


Part III:
find and xargs






Jelmer Poelstra

2021/02/15

1 / 15

Two common (?) problems

  • Say you try to pipe ls output into the imaginary program process_qc:

    $ ls *.fq | process_fq

    Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments.

2 / 15

Two common (?) problems

  • Say you try to pipe ls output into the imaginary program process_qc:

    $ ls *.fq | process_fq

    Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments.

  • This problem wouldn't occur with direct globbing:

    $ process_fq *.fq

    However, a different problem could occur here: if the number of globbed files is very high, you can get an error that the "argument list is too long".

2 / 15

find and xargs

find and xargs can deal with both of these problems.

More generally:

  • find is useful for file searching (and processing) when you need more than what ls can do and/or are worried about spaces.

  • xargs is useful for multi-file and parallel processing without loops.



Let's create some files to try these commands:

$ mkdir -p zmays-snps/{data/seqs,scripts,analysis}
$ touch zmays-snps/data/seqs/zmays{A,B,C}_R{1,2}.fastq
3 / 15

find basics

  • The basic syntax for find is:
    find [path] [expression-1] [expression-2] ....

  • find will print all files by default, like ls – but unlike ls, it is recursive by default:

    $ find zmays-snps # zmays-snps is the path we are operating on
    #> zmays-snps/
    #> zmays-snps/data
    #> zmays-snps/data/seqs
    #> zmays-snps/data/seqs/zmaysC_R1.fastq
    #> zmays-snps/data/seqs/zmaysA_R1.fastq
    #> zmays-snps/data/seqs/zmaysB_R1.fastq
    #> zmays-snps/data/seqs/zmaysB_R2.fastq
    #> zmays-snps/data/seqs/zmaysC_R2.fastq
    #> zmays-snps/data/seqs/zmaysA_R2.fastq
    #> zmays-snps/scripts
    #> zmays-snps/analysis
    $ ls zmays-snps
    #> analysis data scripts
4 / 15

find basics

# Let's move into the zmays-snps dir:
$ cd zmays-snps


  • Using the -name option, we can also glob like we would do with ls:

    $ find data/seqs -name "zmaysB*fastq"
    #> data/seqs/zmaysB_R1.fastq
    #> data/seqs/zmaysB_R2.fastq
5 / 15

find functionality beyond that of ls

  • We can use -type as an expression to limit results by file type:

    # "-type f" to only match regular files
    $ find data/seqs -name "zmaysB*fastq" -type f
6 / 15

find functionality beyond that of ls

  • We can use -type as an expression to limit results by file type:

    # "-type f" to only match regular files
    $ find data/seqs -name "zmaysB*fastq" -type f


  • In the previous command, the two expressions were implicitly connected with a logical and: find regular files with "zmaysB*fastq" in the name.
    We could also do this explicitly using the -and operator:

    $ find data/seqs -name "zmaysB*fastq" -and -type f
  • Or we could connect expression with a logical or using the -or operator:

    $ find data/seqs -name "zmaysA*fastq" -or -name "zmaysC*fastq"
6 / 15

find functionality beyond that of ls

  • We can also negate expressions – but note that we need to quote the ! so it doesn't get evaluated by the shell:

    # Find regular fies that do NOT match "zmaysC*fastq"
    $ find data/seqs -type f "!" -name "zmaysC*fastq"
  • Say we have some files with temp in their names that we want to ignore:

    $ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq
    $ find data/seqs -type f "!" -name "zmaysC*fastq" \
    -and "!" -name "*-temp*"


7 / 15

find functionality beyond that of ls

  • We can also negate expressions – but note that we need to quote the ! so it doesn't get evaluated by the shell:

    # Find regular fies that do NOT match "zmaysC*fastq"
    $ find data/seqs -type f "!" -name "zmaysC*fastq"
  • Say we have some files with temp in their names that we want to ignore:

    $ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq
    $ find data/seqs -type f "!" -name "zmaysC*fastq" \
    -and "!" -name "*-temp*"


But what does a ! mean to the shell? It accesses past commands:

# To re-execute your most recent command:
$ !!
7 / 15

Running commands on find's results

  • Now, we want to remove these temporary files, which we can do using -exec followed by our rm command (using rm -i for interactive removal):

    $ find data/seqs -name "*-temp.fastq" -exec rm -i {} \;

    Here, {} is the placeholder for the file names we are passing on, and we have to indicate the end of our command with an escaped semicolon: \;.

  • -exec can be very useful to process file queries that are more complicated than what a simple globbing pattern allows for.

8 / 15

Running commands on find's results

  • Now, we want to remove these temporary files, which we can do using -exec followed by our rm command (using rm -i for interactive removal):

    $ find data/seqs -name "*-temp.fastq" -exec rm -i {} \;

    Here, {} is the placeholder for the file names we are passing on, and we have to indicate the end of our command with an escaped semicolon: \;.

  • -exec can be very useful to process file queries that are more complicated than what a simple globbing pattern allows for.

With -exec, you can use any command to operate on the files that you found.

But as it happens, because it is such a common operation to remove files found with find, there is a -delete shortcut for this:

find zmays-snps/data/seqs -name "*-temp.fastq" -delete
8 / 15

xargs

  • xargs is to be read/pronounced as x-args, where "args" means "arguments".

    This function passes arguments, supplied via standard in, on to another command. For example:

    $ find data/seqs -name "*-temp.fastq" | xargs rm
  • This is equivalent to the following, where rm receives the very same list of files as its arguments:

    $ rm data/seqs/*-temp.fastq
9 / 15

xargs

  • xargs is to be read/pronounced as x-args, where "args" means "arguments".

    This function passes arguments, supplied via standard in, on to another command. For example:

    $ find data/seqs -name "*-temp.fastq" | xargs rm
  • This is equivalent to the following, where rm receives the very same list of files as its arguments:

    $ rm data/seqs/*-temp.fastq
  • However, now, we can bypass the maximum number of arguments to rm, because we can easily pass one argument at a time with -n 1:

    $ find data/seqs -name "*-temp.fastq" | xargs -n 1 rm

    This command runs rm as many times as there are "*-temp.fastq" files.

9 / 15

xargs

By using xargs instead of find -exec, we separate the process that finds the files we want to operate on, from the actual operations.

This can be beneficial, for instance if we want to carefully inspect a list of files before we do anything:

$ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq
# We save the find results in a text file
$ find data/seqs -name "*-temp.fastq" > files-to-delete.txt
$ cat files-to-delete.txt
#> data/seqs/zmaysC_R1-temp.fastq
#> data/seqs/zmaysA_R1-temp.fastq
#> data/seqs/zmaysA_R2-temp.fastq
#> data/seqs/zmaysC_R2-temp.fastq
# Now we pass that list of files to rm:
$ cat files-to-delete.txt | xargs rm
10 / 15

Using xargs with replacement strings

  • We can also specify where we want to insert the arguments that are being passed on, as is often necessary:

    $ find data/seqs -name "*.fastq" | \
    xargs basename -s ".fastq" | \
    xargs -I{} fastq_stat --in {}.fastq --out results/{}.txt

    Above, we use -I{} to specify that the arguments will be passed on with the placeholder {}. As you can see, we are also able to use {} multiple times.

11 / 15

Using xargs with replacement strings

  • We can also specify where we want to insert the arguments that are being passed on, as is often necessary:

    $ find data/seqs -name "*.fastq" | \
    xargs basename -s ".fastq" | \
    xargs -I{} fastq_stat --in {}.fastq --out results/{}.txt

    Above, we use -I{} to specify that the arguments will be passed on with the placeholder {}. As you can see, we are also able to use {} multiple times.

  • This functionality is similar to what you may be inclined to do in a loop instead – the above is equivalent to:

    $ for $fastq_file in data/seqs/*fastq; do
    fastq_short=$(basename $fastq_file .fastq)
    fastq_stat --in $fastq_file --out results/$fastq_short.txt
    done
11 / 15

xargs and parallelization

One advantage of using xargs instead of a for loop is that we have control over parallelization.

  • To parallelize, i.e. to spawn multiple simultaneous process, we can send each process to the background in the for loop using &:

    $ for $fastq_file in *fastq; do
    fastq_short=$(basename $fastq_file .fastq)
    fastq_stat --in $fastq_file --out results/$fastq_short.txt &
    done

    However, this would spawn exactly as many processes as there are input files, which could be way more than what we want.

12 / 15

xargs and parallelization

One advantage of using xargs instead of a for loop is that we have control over parallelization.

  • To parallelize, i.e. to spawn multiple simultaneous process, we can send each process to the background in the for loop using &:

    $ for $fastq_file in *fastq; do
    fastq_short=$(basename $fastq_file .fastq)
    fastq_stat --in $fastq_file --out results/$fastq_short.txt &
    done

    However, this would spawn exactly as many processes as there are input files, which could be way more than what we want.

  • With xargs, we can control the number of simultaneous processes that are spawned using -P (here set to 6):

    $ find data/seqs -name "*.fastq" | \
    xargs basename -s ".fastq" | \
    xargs -P 6 -I{} fastq_stat --in {}.fastq --out results/{}.txt
12 / 15

xargs and spaces in file names

As always, spaces in file names can trip you, but you were promised a solution with find and xargs.

  • But out of the box, this does in fact fail:

    $ touch "samples A.txt" "samples B.txt"
    $ find . -name "samples [AB].txt" | xargs rm
    #> rm: cannot remove './samples': No such file or directory
    #> rm: cannot remove 'A.txt': No such file or directory
    #> rm: cannot remove './samples': No such file or directory
    #> rm: cannot remove 'B.txt': No such file or directory
  • The solution is to use -print0 with find, so it prints files with a "null byte" as the delimiter. Then, we also need to tell xargs about this, using -0:

    $ find . -name "samples [AB].txt" -print0 | xargs -0 rm -v
    #> removed './samples A.txt'
    #> removed './samples B.txt'
13 / 15

More parallelization

If you like the xargs approach to running multiple processes, you may also be interested in the parallel command, which is even more powerful than xargs.

I personally don't do this very much because I tend to submit separate batch jobs for each file, which can easily be done with simple for loops.

14 / 15

Questions?






15 / 15

Two common (?) problems

  • Say you try to pipe ls output into the imaginary program process_qc:

    $ ls *.fq | process_fq

    Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments.

2 / 15
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow