Regular Expressions RegEx RegExes Regexp

reg(ular expressions?|ex(p|es)?)

Search for ‘reg’ with an either or pattern of ‘ular expressions’ or ex(p|es). The question marks tests for the ‘s’ at the end of expression or p or es at the end of ex.

gray grey griy gr0y graey graAy gra0y gry

/gr[a-z A-Z 0-9]?y/ - match anyword that starts with ‘gr’, ends with ‘y’ and has any lowercase or uppercase letter as well as any number between 0 and 9. The question mark checks for character sets but still matches if not present.

gr*y gr+y gr

gr[^a-z A-Z 0-9]?y - uses a ‘caret ^’ to negate the character class. In this instance ‘gry’ and ’gr*y’ are highlighted as they feature no character nor number between ‘gr’ and ‘y’(or indeed anything at all).

gr[*+]y - metacharacters do not need to be escaped within character classes but backslashes must be escaped with a, hmm, backslash.

gr[\\*+]y - if one needs to use the special characters in a character class like a caret ^ then either place it after the usual position(straight after the opening bracket) or escape it.

833337 | 837 | 222 | AWordwithan_underscore

[0-9]+ - matches all numbers within boundaries ([0-9])\1+ - matches all repeating numbers \w+ - matches word characters including numbers and underscores in Javascript but what is considered a word character is dependant on the engine/language used \D - is equivilant to [^\d] (negated number) \W - is equivilant to [^\w] (negated word character) \S - guess! .+ - the dot character matches any single character except line breaks. When one adds a plus sign one can match paragraphs!

02/32/03 | 021203 | 02512503 | 02-12-03

\d\d.\d\d.\d\d - using a dot to match a date string with a variable seperator. It may seem ok to using a dot as catch-all for the seperator but as you can see this will match ANY character between the numbers [01]\d[- / .][0-3]\d[- / .]\d\d - to match a date you would also have to take into consideration dates that cannot exist eg. 50/12/03. Here we specify that the first character can only be 0 or 1 followed by another number. Unfortunately this pattern falls down at the next stage. The day allows for numbers between 0-3 meaning a day of ‘39’ could be valid. Let’s return to this later.

abc | 89790983

^a - a caret in this instance is referred to as an anchor. This sets the position of the search to a something that starts with an ‘a’. c$ - like the caret above, this checks for a string that ends with a ‘c’. ^\s+$ - match leading and trailing whitespace

This island is beautiful

\bis\b - denotes a word boundry that functions like a caret or dollar sign. Using this allows one to preform a whole word search

Cat Dog Mouse Fish

cat|dog - alternation, match either cat or dog using a pipe cat|dog|mouse|fish - add more alternatives to the mix \b(cat|dog)\b - search for a whole word or either cat or dog between word boundries. One must use parenthesis for the search terms otherwise the search will look for at OR dog etValue Set Get GetValue

\b(get|set)(value)?\b - search for function names that start with either ‘get’ or ‘set’ and optionally are followed by ‘value’. These terms are surrounded by word boundries.

colour color creamcolour Nov November

colou?r - not sure if the text recieved is from a Briton or an American? No problem. The question mark denotes an optional expression. nov(ember)? - the month may(sic) be written in shorthand or fully. Use the optional notation to grab both. colou{0,1}r - an alternative way to write optional expressions using curly braces

<[a-z A-Z][a-z A-Z 0-9]*> - match html tags using the star sign. The star sign repeats the second character class or none at all. This is important as the ’

’ tag will be matched by the first character class but not the second. The angled brackets have no special meaning so do not need to be escaped. </?[a-z A-Z][a-z A-Z 0-9]*> - add an optional backspace (with question mark) to get closing tags too!

1000 | 9999 | 100 | 99999

\b[1-9][0-9]{3}\b - to limit the number of repartitions one can use curly braces. The format is {*min*, *max*}. The example here searches for a number bewteen 1 and 9 followed by 3 additional numbers between 0-9. This allows us to find any number between 1000 and 9999. The word boundries ensure whole numbers. \b[1-9][0-9]{2,4}\b - this expression will match any number between 100 and 99999. There must be a minimum of two additional numbers and a maximum of 4 giving the range.

This is a <EM>first</EM> test

<.+> - if we want to grab html tags using a token dot and plus sign the search engine will continue to match all characters after the first angled bracket including the closing bracket and the rest of the string. It then procededs to ‘backtrack’ until it finds the closing bracket giving us a value of ‘first’. Not what we want. <.+?> - using a question mark negates the ‘greediness’ with ‘laziness’. Using a lazy search pattern tells the engine to do as little as possible. In this case it will match the first suitable character after the opening angled bracket. It then procedes to check the closing angled bracket. <[^>]+> - instead of using an expression that backtracks, which is bad for performance purposes, one can use a negated character class. This expression will search all characters that are not closing angled brackets.

color=red color=blue color=green axaxax bxbxbx cxcxcx

color=(?:red|blue|green)? - capturing groups, denoted by parenthesis, can be converted into non-capturing groups by adding a question mark, colon to the start of the group and then using another question mark at the end of the group ([a-z])x\1x\1x - using a backslash and a number allows one to reference capturing groups. This can be used to reuse expressions within the same expression. The numbers refer to the order used from left to right so the first capturing group from the left will be \1 and so forth. In this example we are capturing any character between a and c and then reusing that group twice with the \1 reference.

Testing <B class=“king”><I class=“ming”>bold italic</I></B>

<([A-Z][A-Z 0-9]*)\b[^>]*>.*?</\1> - this scary looking pattern grabs opening and closing tags with all content inbetween. We’re not quite ready to fully underdtand this but stick with it. The star doesn’t need to match. The [^>] advances through the string and allows us to capture tag attributes. When we’re ready we can write this stuff from memory.

The The

\b(\w+)\s+\1\b - using capturing groups we can remove repeated words. Here we search for a word between boundries that have at least one space before the repeated word. Using a ‘find and replace’ in a text editor allows one to simply replace the duplicates with \1; the capturing group!

PRACTICE!