class:inverse middle center # *Week 11 - Python: Regular expressions* ---- <br> <br> <br> <br> <br> ### Jelmer Poelstra ### 2021/03/30 --- ## Setup ```python import os # IN PYTHON username = os.environ.get('USER') os.chdir('/fs/ess/PAS1855/users/' + username + '/CSB/regex/sandbox') ``` <br> ### Snakemake installation for next week ```sh module load python/3.6-conda5.2 # IN THE SHELL source activate ipy-env which pip3 # Confirm "pip3" is located inside your ipy-env: #> ~/.conda/envs/ipy-env/bin/pip3 pip3 install snakemake ``` --- ## Overview of this week - Today: Regular expressions - No meeting on Thursday! --- ## What are regular expressions? > *Often, it is not feasible to search for all possible occurrences exactly as* > *they appear in the text, but you can describe the pattern you’re looking for* > *in your own words (e.g., find all words starting with 3 uppercase letters,* > *followed by 4 digits).* > ***The question is how to explain such a pattern to a computer.* > *The answer is to use regular expressions.*** > — CSB 5.1 In other words, regular expressions are *strings that include special symbols to succinctly describe patterns.* --- ## Regular expressions in the shell In this course, we have already worked with regular expressions a fair bit, using the tools `grep` and `sed` – for example: ```sh grep -v "^#" my.gff ``` -- Exclude comment/header lines: **`^#` matches a pound sign as the first character on a line** (and the `-v` option inverts `grep`'s behavior: return non-matching lines). -- ```sh echo "word1 word2" | sed -E 's/(\w+) (\w+)/\2 \1/' ``` Invert two words: - `\w+` matches one or more word characters (alphanumeric and `_`) but not the space that separates them. - We *capture* the two words with `()` and *refer back* to each using `\1` and `\2`. -- .content-box-green[ Using regular expressions in Python is much easier because we don't have to switch between basic/extended/Perl-like sets! ] --- ## Why use regular expressions? - **To collect information:** Match degenerate primer sequences, transcription factor binding sites, find accession and gene numbers, extract references from a manuscript. - **As a more sophisticated replace:** Replace multiple variations of a string at the same time. - **To navigate and parse text files:** Parse (semi-)structured output that doesn't adhere to formats for which you can use a specialized tool directly. --- ## The `re` module in Python - You may recall that we used the `find()` method earlier to search within strings, but **`find()` only matches literal strings**. - To use regular expressions, we will need to import the **`re` module**: ```python import re ``` --- ## Our first match - The main `re` function we will be using to learn about regular expression in Python is `re.search()`, which will return *the first match* it finds.<sup>[1]</sup> We start using it by searching for a *literal string*, after all: ```python # We define a string that we will query: my_string = "a given string" # We search my_string for the regex pattern "given": m = re.search("given", my_string) ``` .footnote[ <sup>[1]</sup> We'll see other functions that return all matches later. ] --- ## Match objects - `re.search()` function returns a *match object*: ```python m #> <re.Match object; span=(2, 7), match='given'> ``` - On the match object, we can use the **`group()`** method to extract the substring that was found to match our pattern: ```python m.group() #> 'given' # We matched a literal string so no surprises here ``` -- - We can also get the start and end positions (indices) of the match: ```python m.start() #> 2 m.end() #> 7 ``` --- ## Failing to match - If no matches are found, `re.search()` returns `None`: ```python m = re.search("9", my_string) print(m) #> None ``` <br> -- - Since `None` is "**falsy**" (it will be interpreted as `False`), we can create a straightforward test to see if a match was made<sup> [1]</sup>: ```python if re.search("9", my_string): # Additional processing if a match was found ``` .footnote[ <sup>[1]</sup> This is a little easier than what we did last week with the `str.find()` method, which returns `-1` if not match is found) ] --- ## Backslash plague! Python uses backslashes `\` to give special meaning to some characters: ```python print("my long \n line") #> my long #> line ``` If we need to **include a literal backslash** in a string, we have to escape the backslash, which we do with another backslash: ```python print("my long \\n line") #> my long \n line ``` -- This can get annoying, especially with regular expressions, **which use `\` for similar purposes as Python does**: so to match a literal backslash, we need to create a regular expression string that contains two backslashes `\\`. However, we just saw that Python turns `\\` into `\`, so in fact, we need **four** backslashes to match a single one! *This is called the "backslash plague".* ```python re.search('\\\\', "my long \\n line") #> <re.Match object; span=(8, 9), match='\\'> ``` --- ## "Raw string" notation To escape this plague, Python provides raw string notation, which **turns off its native backslash interpretation**: ```python print(r"my long \n line") #> my long \n line re.search(r"\\n", r"my long \n line") # <re.Match object; span=(8, 10), match='\\n'> ``` So, the syntax for raw strings is to simply put an `r` before the string. <br> .content-box-info[ **From here on, we'll consistently use the raw string notation in regular expressions searches.** ] --- class: center middle inverse # Components of regular expressions ---- <br> <br> <br> --- ## Literal characters Literal characters can be included in regular expressions, and as we have seen, a search pattern in `re` can consist of *only* literal characters, but this is not very powerful. --- ## Metacharacters | Symbol | Negation | Matches: |---------------|-----------|-------- | `\n` | | A newline. | `\t` | | A tab. | `\s` | `\S` | Any / anything but white space: space, tab, newline, carriage return. | `\w` | `\W` | Any / anything but a word character: alphanumeric and underscore. | `\d` | `\D` | Any / anything but a digit. | `.` | | Any character. --- ## Match whitespace - For example, to match a single whitespace character, we can use `\s`: ```python # my_string = "a given string" re.search(r"\s", my_string) #> <re.Match object; span=(1, 2), match=' '> ``` - Or we can match a whitespace character followed by five "word" characters (alphanumeric and underscore): ```python re.search(r"\s\w\w\w\w\w", my_string) #> <re.Match object; span=(1, 7), match=' given'> ``` .content-box-info[ Note again that `re.search()` only returns the first match. We'll see two functions later that will return all matches. ] --- ## Character classes You hopefully recall **character classes** from the shell, where we used constructs like: - `[0-9]` to represent any single digit - `[ABZ]` to represent either an A, a B, or a Z (and not "*ABZ*"!). Character classes function the same in Python (though note that they are referred to as "*sets*" in the CSB book). --- ## Character classes (cont.) - For example, we can search for a word that starts with a *lower- **or** uppercase `s`*, followed by two word (`\w`) characters: ```python my_string = "sunflowers are described on page 89" re.search(r"[sS]\w\w", my_string) #> <re.Match object; span=(0, 3), match='sun'> ``` --- ## Character classes (cont.) - Or we can search for a number consisting of two digits between 7 and 9: ```python # my_string = "sunflowers are described on page 89" re.search(r"[7-9][7-9]", my_string) #> <re.Match object; span=(33, 35), match='89'> ``` -- .content-box-q[ What would happen if `my_string` contained `898` instead of `89`? ] -- .content-box-answer[ The match would be exactly the same. ] -- .content-box-q[ What if we did not want to match such a number with three digits? ] -- .content-box-answer[ ```python re.search(r"[7-9][7-9]\D", my_string) # Or: re.search(r"[7-9][7-9][^0-9]", my_string) ``` ] --- ## Negating a character class We can negate a character class by using a caret `^` as the first symbol within the square brackets. For example, `[^s-z]` matches anything but a letter from s-z in the alphabet: ```python # my_string = "sunflowers are described on page 89" re.search(r"[^s-z]\w\w\w\w\w\w", my_string) #> <re.Match object; span=(2, 9), match='nflower'> ``` --- ## Quantifiers Instead of typing `\w` five times, like above, we can use *quantifiers* to match multiple consecutive characters more succinctly and flexibly: <br> | Quantifier | Matches | |-------------|---------| | **`*`** | Preceding character *any number of times* (including 0). | **`?`** | Preceding character *at most* once. | **`+`** | Preceding character *at least* once. | **`{n}`** | Preceding character *exactly `n` times*. | **`{n,}`** | Preceding character *at least `n` times*. | **`{n,m}`** | Preceding character *at least `n` and at most `m` times*. --- ## Using quantifiers - From some text, we could attempt to extract a DNA sequence: at least three capital A/C/G/T using `[ACGT]{3,}`: ```python re.search(r"[ACGT]{3,}", "the motif ATTCGT").group() 'ATTCGT' ``` --- ## Alternations A symbol we have also already seen is the `|` for *alternation*, functioning as a logical or: ```python re.search(r"cat|mouse", "I found my cat!") #> <re.Match object; span=(11, 14), match='cat'> re.search(r"cat|mouse", "I found my mouse!") #> re.search(r"cat|mouse", "I found my mouse!") ``` --- ## Greedy vs. non-greedy matching The quantifiers `?`, `*` and `+` are said to be "**greedy**". -- This means that when faced with multiple possible endpoints of a match, they will use the last possible endpoint. In other words, they will **match as much text as possible**. <br> For example, when matching any number of characters followed by a white space below – instead of matching `"once "`, we get: ```python my_string = "once upon a time" re.search(r".*\s", my_string).group() #> 'once upon a ' ``` -- Appending a question mark to the quantifier (i.e., `??`, `*?`, or `+?`) makes the quantifier **"reluctant" or "non-greedy"** – it will match as little as possible: ```python re.search(r".*?\s", "once upon a time").group() #> 'once ' ``` --- ## Anchors Often, we want to restrict matching to the beginning or the end of a string (or equivalently, of a *line* in a file): Anchor | Matches -------|-------- `^` | Beginning of the string/line (Note re-usage: **cf. `[^0-9]`**) `$` | End of the string/line <br> For example: ```python my_string = "ATATA" re.search(r"^TATA", my_string) #> re.search(r"TATA$", my_string) #> <re.Match object; span=(1, 5), match='TATA'> ``` --- ## <i class="fa fa-user-edit"></i> Intermezzo 5.1 Describe the following regular expressions in plain English. And what does it actually match in the `re.search()` call? ```python re.search(r"\d" , "it takes 2 to tango").group() ``` -- Matches one digit: **`2`**. -- <br> ```python re.search(r"\s\w*\s", "once upon a time").group() ``` -- Matches any sequence of word characters (zero or more), flanked by a white space on both sides: **`" upon "`**. <br> -- ```python re.search(r"\s\w*$", "once upon a time").group() ``` -- Matches the last word in the target string (preceded by a white space): **`" time"`**. --- ## What if I need to match a literal `*`? Sometimes, you need to **literally match a special character**, like a `*`. That is, you want to take away ("escape") its special meaning. How can we do this? -- Just like escaping backslashes, we can escape any other character's special meaning **with a backslash**: ```python re.search(r'*', 'my*string') #> error: nothing to repeat at position 0 re.search(r'\*', 'my*string') #> <re.Match object; span=(2, 3), match='*'> ``` .content-box-info[ We also saw this in the shell where one way to match a file name with a space, other than quoting, is to escape the space: ```sh $ ls my\ bad\ file #> 'my bad file' ``` ] --- ## Other functions of the *re* module Function | Behavior -------------|--------- `re.findall(reg,target)` | As `re.search`, but return a **list** of all the matches. `re.finditer(reg,target)` | As `re.findall`, but return an **iterator**. `re.compile(reg)` | Compile a regular expression. <br> This stores the pattern for repeated use, improving the speed. `re.split(reg,target)` | Split the text by the occurrence of the pattern described by the regular expression. `re.sub(reg,repl,target)` | Substitute each non-overlapping occurrence of the match with the text in `repl`. --- ## Example: Querying a FASTA file Say that we want to find methylation sites in nucleotide sequences from *Escherichia coli* that are stored in the file `CSB/regex/data/Ecoli.fasta`. There are several methylases in *E. coli*. The *EcoKI* methylase, for example, modifies the sequences `AACNNNNNNGTGC` *and* `GCACNNNNNNGTT`. -- Let's take a look at the file: ```python fafile = "../data/Ecoli.fasta" !head -5 {fafile} #> >gi|556503834|ref|NC_000913.3|:1978338-2028069 Escherichia coli str. K-12 substr. #> MG1655, complete genome #> AATATGTCCTTACAAATAGAAATGGGTCTTTACACTTATCTAAGATTTTTCCTAAATCGACGCAACTGTA #> CTCGTCACTACACGCACATACAACGGAGGGGGGCTGCGATTTTCAATAATGCGTGATGCAGATCACACAA #> AACACTCAATTACTTAACATAAATGGGTAATGACTCCAACTTATTGATAGTGTTTTATGTTCAGATAATG #> CCCGATGACTTTGTCATGCAGCTCCACCGATTTTGAGAACGACAGCGACTTCCGTCCCAGCCGTGCCAGG !grep ">" {fafile} #> >gi|556503834|ref|NC_000913.3|:1978338-2028069 Escherichia coli str. K-12 substr. MG1655, complete genome #> >gi|556503834|ref|NC_000913.3|:4035299-4037302 Escherichia coli str. K-12 substr. MG1655, complete genome ``` --- ## Example: Querying a FASTA file (cont.) Let's read the FASTA file, and take the sequence for the first of the two records in the file: ```python from Bio import SeqIO genes = list(SeqIO.parse(fafile, "fasta")) seq1 = str(genes[0].seq) ``` ```python len(seq1) #> 49732 seq1[:40] #> 'AATATGTCCTTACAAATAGAAATGGGTCTTTACACTTATC' ``` <br> <br> .content-box-info[ Note, we used with `SeqIO` instead of `pyfaidx` (as in CSB) here to simplify the sequence parsing and not introduce a new module. ] --- ## Example: Querying a FASTA file (cont.) Since our regular expression is quite complicated, we will first save it using the `re.compile()` function: ```python EcoKI = re.compile(r"AAC[ATCG]{6}GTGC|GCAC[ATCG]{6}GTT") # Matches: # AACNNNNNNGTGC # GCACNNNNNNGTT ``` --- ## Example: Querying a FASTA file (cont.) Also, we will need a different `re` function to get *all* matches. Let's first try **`re.findall()`**: ```python re.findall(EcoKI, seq1) #> ['AACAGCATCGTGC', 'AACTGGCGGGTGC', 'GCACCACCGCGTT', #> 'GCACAACAAGGTT', 'GCACCGCTGGGTT', 'AACCTGCCGGTGC'] ``` That simply returned a list of matches, but we want the *positions* too! -- Instead, we can use **`re.finditer()`**, which will get us match objects like we got for `re.search()` earlier, that we can iterate over: ```python for match in re.finditer(EcoKI, seq1): print(match.start() + 1, match.group()) #> 18452 AACAGCATCGTGC #> 18750 AACTGGCGGGTGC #> 25767 GCACCACCGCGTT #> 35183 GCACAACAAGGTT #> 40745 GCACCGCTGGGTT #> 42032 AACCTGCCGGTGC ``` --- ## Groups and backreferences in regular expressions Using parentheses, we can create **groups** in regular expressions. We have already seen the use of such groups with backreferences in `sed`, and another feature of groups is that **quantifiers will operate on the entire group**: ```python re.search(r"GT{2}", "ATGGTGTCCGTGTT").group() #> 'GTT' re.search(r"(GT){2}", "ATGGTGTCCGTGTT").group() #> 'GTGT' ``` --- ## Example: retrieving the sequence between primers Say that we want to identify the variable region that is flanked by the bacterial ribosomal RNA primers 799F and 904R. We can build a regular expression with the two primers, each of which we capture with `()`, and the unknown sequence which we also capture: ```python primer1 = 'AAC[AC]GGATTAGATACCC[GT]G' primer2 = '[CT]T[AG]AAACTCAAATGAATTGACGGGG' regpat = '(' + primer1 + ')' + '([ATCG]+)' + '(' + primer2 + ')' regpat #> '(AAC[AC]GGATTAGATACCC[GT]G)([ATCG]+)([CT]T[AG]AAACTCAAATGAATTGACGGGG)' ``` -- We will search for this pattern in the second sequence of `Ecoli.fasta`: ```python # from Bio import SeqIO # genes = list(SeqIO.parse("../data/Ecoli.fasta", "fasta")) seq2 = str(genes[1].seq) m = re.search(regpat, seq2) ``` --- ## Example: retrieving the sequence between primers Thanks to our groupings, we can now refer back to these separately: - The entire match with `group(0)`: ```python len(m.group(0)) #> 148 ``` - Our forward and reverse primers with `group(1)` and `group(3)` ```python m.group(1) #> 'AACAGGATTAGATACCCTG' m.group(3) #> 'TTAAAACTCAAATGAATTGACGGGG' ``` - And the sequence of interest with `group(2)`: ```python m.group(2) #> 'GTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGG' len(m.group(2)) #> 104 ``` --- class: center middle inverse # Questions? ----- <br> <br> <br> <br> --- ## Example: finding names of MHC alleles in BLAST results The file `Marra2014_BLAST_data.txt` has some BLAST results in tabular format. We are looking for MHC alleles and we happen to know that their names are unique in containing an asterisk `*`. We start by moving to the right directory and peeking at the file: ```python import os os.chdir('/fs/ess/PAS1855/users/<user>/CSB/regex/sandbox') !head ../data/Marra2014_BLAST_data.txt #> Seq. Name Seq. Description Seq. Length #Hits min. eValue mean Similarity #> contig00001 ---NA--- 112 0 #> contig00002 pol2_mouse ame: full=retrovirus-related pol polyprotein line-1 ame: full=long interspersed element-1 short=l1 includes: ame: full=reverse transcriptase includes: ame: full=endonuclease 932 3 6.53E-113 79.67% #> contig00003 lin1_nycco ame: full=line-1 reverse transcriptase homolog 2070 5 0 59.60% #> contig00004 lin1_nycco ame: full=line-1 reverse transcriptase homolog 3074 5 6.92E-68 64.80% #> contig00005 ---NA--- 190 0 ``` --- ## Example: finding names of MHC alleles in BLAST results Let's first find, print and count lines with an `*`: ```python with open("../data/Marra2014_BLAST_data.txt") as f: counter = 0 for line in f: m = re.search(r"\*", line) # Search for '*' in each line if m: # If a match was found: counter += 1 # Increment counter print(line) # Print the line #> contig01987 2b14_human ame: full=hla class ii histocompatibility drb1-4 beta chain ame: full=mhc class ii antigen drb1*4 short=dr-4 short=dr4 flags: precursor 112 5 5.07E-10 78.60% #> contig05816 1c08_human ame: full=hla class i histocompatibility cw-8 alpha chain ame: full=mhc class i antigen cw*8 flags: precursor 825 5 1.06E-97 67.80% #> contig05821 1b46_human ame: full=hla class i histocompatibility b-46 alpha chain ame: full=bw-46 ame: full=mhc class i antigen b*46 flags: precursor 606 5 9.86E-09 67.60% #> contig22120 2b11_human ame: full=hla class ii histocompatibility drb1-1 beta chain ame: full=mhc class ii antigen drb1*1 short=dr-1 short=dr1 flags: precursor 137 5 5.05E-11 76.80% #> contig23154 mmrn1_human ame: full=multimerin-1 ame: full=emilin-4 ame: full=elastin microfibril interface located protein 4 short=elastin microfibril interfacer 4 ame: full=endothelial cell multimerin contains: ame: full=platelet glycoprotein ia* contains: ame: full=155 kda platelet multimerin short=p-155 short=p155 flags: precursor 4524 5 0 62.00% #> contig27667 ud13_human ame: full=udp-glucuronosyltransferase 1-3 short=udpgt 1-3 short=ugt1*3 short=ugt1-03 short= ame: full=udp-glucuronosyltransferase 1-c short=ugt-1c short=ugt1c ame: full=udp-glucuronosyltransferase 1a3 flags: precursor 1818 5 0 84.40% ``` <br> ```python print("The pattern was matched in", counter, "lines") #> The pattern was matched in 6 lines ``` --- ## Example: finding names of MHC alleles in BLAST results ```sh #> [...] full=mhc class ii antigen drb1*1 short=dr-1 [...] #> [...] full=mhc class i antigen cw*8 flags: precursor 606 [...] ``` From looking at those lines, if we want to extract just the names of the MHC alleles, we should start matching at `mhc` and stop matching after the *word* that contains the asterisk `*`. So, our regex could be: `mhc[\s\w*]+\*\w*`: Component | Meaning -------------|--------- `mhc` | Match the literal characters `mhc`. `[\s\w*]+` | Match a white space or zero or more word characters, one or more times. `\*` | Match a (literal) asterisk. `\w*` | Match zero or more word characters. --- ## Example: finding names of MHC alleles in BLAST results Let's try it: ```python with open("../data/Marra2014_BLAST_data.txt") as f: for line in f: m = re.search(r"mhc[\s\w*]+\*\w*", line) if m: print(m.group()) #> mhc class ii antigen drb1*4 #> mhc class i antigen cw*8 #> mhc class i antigen b*46 #> mhc class ii antigen drb1*1 ``` --- ## <i class="fa fa-user-edit"></i> More intermezzo 5.1 ```python re.search(r"\s\w{1,3}\s", "once upon a time").group() ``` -- Matches a sequence of one to three word characters, flanked by two white spaces: `" a "`. ```python re.search(r"\w*\s\d.*\d", "take 2 grams of H2O").group() ``` -- Matches zero or more word characters (`\w*`), followed by a white space (`\s`), followed by a digit (`\d`), zero or more characters (`.*`), and ending with a digit (`\d`): `take 2 grams of H2`. --- ## <i class="fa fa-user-edit"></i> Intermezzo 5.2 Let’s practice translating from plain English to regular expressions. The NCBI GenBank contains information on nucleotide sequences, protein sequences, and whole genome sequences (WGS). The following table describes the construction of sequence identifiers in plain English. Construct the appropriate regular expression to match either protein, WGS or nucleotide IDs: 1. **Protein**: 3 letters + 5 numerals 2. **WGS**: 4 letters + 2 numerals for WGS assembly version + 6–8 numerals 3. **Nucleotide**: 1 letter + 5 numerals OR 2 letters + 6 numerals --- ## <i class="fa fa-user-edit"></i> Intermezzo 5.2: Solutions 1. ```python r"[A-Za-z]{3}\d{5}" ``` <br> 2. ```python r"[A-Za-z]{4}\d{8,10}" ``` <br> 3. ```python r"([A-Z]{1}\d{5}|[A-Z]{2}\d{6})" ```