class:inverse middle center ## *Week 6: Shell scripting* ---- # Part III: <br> find and xargs <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/02/15 --- ## Two common (?) problems - Say you try to pipe `ls` output into the imaginary program `process_qc`: ```sh $ ls *.fq | process_fq ``` Like we've seen before, any spaces in file names would lead to trouble, since those file names would be erroneously split into separate arguments. -- - This problem wouldn't occur with direct globbing: ```sh $ process_fq *.fq ``` However, a different problem could occur here: if the number of globbed files is very high, you can get an error that the "*argument list is too long*". --- ## `find` and `xargs` **`find` and `xargs` can deal with both of these problems.** More generally: - `find` is useful for file searching (and processing) when you need more than what `ls` can do and/or are worried about spaces. - `xargs` is useful for multi-file and parallel processing without loops. <br> <br> **Let's create some files to try these commands:** ```sh $ mkdir -p zmays-snps/{data/seqs,scripts,analysis} $ touch zmays-snps/data/seqs/zmays{A,B,C}_R{1,2}.fastq ``` --- ## `find` basics - The basic syntax for `find` is: `find [path] [expression-1] [expression-2] ...`. - `find` will print all files by default, like `ls` – but unlike `ls`, it is recursive by default: ```sh $ find zmays-snps # zmays-snps is the path we are operating on #> zmays-snps/ #> zmays-snps/data #> zmays-snps/data/seqs #> zmays-snps/data/seqs/zmaysC_R1.fastq #> zmays-snps/data/seqs/zmaysA_R1.fastq #> zmays-snps/data/seqs/zmaysB_R1.fastq #> zmays-snps/data/seqs/zmaysB_R2.fastq #> zmays-snps/data/seqs/zmaysC_R2.fastq #> zmays-snps/data/seqs/zmaysA_R2.fastq #> zmays-snps/scripts #> zmays-snps/analysis $ ls zmays-snps #> analysis data scripts ``` --- ## `find` basics ```sh # Let's move into the zmays-snps dir: $ cd zmays-snps ``` <br> - Using the `-name` option, we can also *glob* like we would do with `ls`: ```sh $ find data/seqs -name "zmaysB*fastq" #> data/seqs/zmaysB_R1.fastq #> data/seqs/zmaysB_R2.fastq ``` --- ## `find` functionality beyond that of `ls` - We can use `-type` as an expression to limit results by file type: ```sh # "-type f" to only match regular files $ find data/seqs -name "zmaysB*fastq" -type f ``` -- <br> - In the previous command, the two expressions were implicitly connected with a logical *and*: find regular files with "_zmaysB*fastq_" in the name. We could also do this explicitly using the `-and` operator: ```sh $ find data/seqs -name "zmaysB*fastq" -and -type f ``` - Or we could connect expression with a logical or using the `-or` operator: ```sh $ find data/seqs -name "zmaysA*fastq" -or -name "zmaysC*fastq" ``` --- ## `find` functionality beyond that of `ls` - We can also negate expressions – but note that we need to quote the `!` so it doesn't get evaluated by the shell: ```sh # Find regular fies that do NOT match "zmaysC*fastq" $ find data/seqs -type f "!" -name "zmaysC*fastq" ``` - Say we have some files with *temp* in their names that we want to ignore: ```sh $ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq $ find data/seqs -type f "!" -name "zmaysC*fastq" \ -and "!" -name "*-temp*" ``` <br> -- .content-box-info[ But what does a `!` mean to the shell? It accesses past commands: ```sh # To re-execute your most recent command: $ !! ``` ] --- ## Running commands on `find`'s results - Now, we want to remove these temporary files, which we can do using `-exec` followed by our `rm` command (using `rm -i` for interactive removal): ```sh $ find data/seqs -name "*-temp.fastq" -exec rm -i {} \; ``` Here, `{}` is the placeholder for the file names we are passing on, and we have to indicate the end of our command with an escaped semicolon: `\;`. - `-exec` can be very useful to *process file queries* that are more complicated than what a simple globbing pattern allows for. -- .content-box-info[ With `-exec`, you can use any command to operate on the files that you found. But as it happens, because it is such a common operation to remove files found with `find`, there is a `-delete` shortcut for this: ```sh find zmays-snps/data/seqs -name "*-temp.fastq" -delete ``` ] --- ## `xargs` - `xargs` is to be read/pronounced as `x-args`, where "*args*" means "*arguments*". This function passes arguments, supplied via standard in, on to another command. For example: ```sh $ find data/seqs -name "*-temp.fastq" | xargs rm ``` - This is equivalent to the following, where `rm` receives the very same list of files as its arguments: ```sh $ rm data/seqs/*-temp.fastq ``` -- - However, now, we can bypass the *maximum number of arguments* to `rm`, because we can easily pass one argument at a time with `-n 1`: ```sh $ find data/seqs -name "*-temp.fastq" | xargs -n 1 rm ``` This command runs `rm` as many times as there are _"*-temp.fastq"_ files. --- ## `xargs` By using `xargs` instead of `find -exec`, we separate the process that finds the files we want to operate on, from the actual operations. This can be beneficial, for instance if we want to carefully inspect a list of files before we do anything: ```sh $ touch data/seqs/zmays{A,C}_R{1,2}-temp.fastq # We save the find results in a text file $ find data/seqs -name "*-temp.fastq" > files-to-delete.txt $ cat files-to-delete.txt #> data/seqs/zmaysC_R1-temp.fastq #> data/seqs/zmaysA_R1-temp.fastq #> data/seqs/zmaysA_R2-temp.fastq #> data/seqs/zmaysC_R2-temp.fastq # Now we pass that list of files to rm: $ cat files-to-delete.txt | xargs rm ``` --- ## Using `xargs` with replacement strings - We can also specify where we want to insert the arguments that are being passed on, as is often necessary: ```sh $ find data/seqs -name "*.fastq" | \ xargs basename -s ".fastq" | \ xargs -I{} fastq_stat --in {}.fastq --out results/{}.txt ``` Above, we use `-I{}` to specify that the arguments will be passed on with the placeholder `{}`. As you can see, we are also able to use `{}` multiple times. -- - This functionality is similar to what you may be inclined to do in a loop instead – the above is equivalent to: ```sh $ for $fastq_file in data/seqs/*fastq; do fastq_short=$(basename $fastq_file .fastq) fastq_stat --in $fastq_file --out results/$fastq_short.txt done ``` --- ## `xargs` and parallelization One advantage of using `xargs` instead of a `for` loop is that we have control over parallelization. - To parallelize, i.e. to spawn multiple simultaneous process, we can send each process to the background in the `for` loop using `&`: ```sh $ for $fastq_file in *fastq; do fastq_short=$(basename $fastq_file .fastq) fastq_stat --in $fastq_file --out results/$fastq_short.txt & done ``` However, this would spawn exactly as many processes as there are input files, which could be way more than what we want. -- - With `xargs`, we can control the number of simultaneous processes that are spawned using `-P` (here set to 6): ```sh $ find data/seqs -name "*.fastq" | \ xargs basename -s ".fastq" | \ xargs -P 6 -I{} fastq_stat --in {}.fastq --out results/{}.txt ``` --- ## `xargs` and spaces in file names As always, spaces in file names can trip you, but you were promised a solution with `find` and `xargs`. - But out of the box, this does in fact fail: ```sh $ touch "samples A.txt" "samples B.txt" $ find . -name "samples [AB].txt" | xargs rm #> rm: cannot remove './samples': No such file or directory #> rm: cannot remove 'A.txt': No such file or directory #> rm: cannot remove './samples': No such file or directory #> rm: cannot remove 'B.txt': No such file or directory ``` - The solution is to use `-print0` with `find`, so it prints files with a "null byte" as the delimiter. Then, we also need to tell `xargs` about this, using `-0`: ```sh $ find . -name "samples [AB].txt" -print0 | xargs -0 rm -v #> removed './samples A.txt' #> removed './samples B.txt' ``` --- ## More parallelization .content-box-info[ If you like the `xargs` approach to running multiple processes, you may also be interested in the `parallel` command, which is even more powerful than `xargs`. I personally don't do this very much because I tend to submit separate batch jobs for each file, which can easily be done with simple `for` loops. ] --- class: inverse middle center # Questions? ---- <br> <br> <br> <br>