reg(ular expressions?|ex(p|es)?)
Search for ‘reg’ with an either or pattern of ‘ular expressions’ or ex(p|es). The question marks tests for the ‘s’ at the end of expression or p or es at the end of ex.
gray grey griy gr0y graey graAy gra0y gry
/gr[a-z A-Z 0-9]?y/ - match anyword that starts with
‘gr’, ends with ‘y’ and has any lowercase or uppercase letter as well as
any number between 0 and 9. The question mark checks for character sets
but still matches if not present.
gr*y gr+y gr
gr[^a-z A-Z 0-9]?y - uses a ‘caret ^’ to negate
the character class. In this instance ‘gry’ and ’gr*y’ are highlighted
as they feature no character nor number between ‘gr’ and ‘y’(or indeed
anything at all).
gr[*+]y - metacharacters do not need to be escaped
within character classes but backslashes must be escaped with a, hmm,
backslash.
gr[\\*+]y - if one needs to use the special characters
in a character class like a caret ^ then either place it after the usual
position(straight after the opening bracket) or escape it.
833337 | 837 | 222 | AWordwithan_underscore
[0-9]+ - matches all numbers within boundaries
([0-9])\1+ - matches all repeating numbers \w+
- matches word characters including numbers and underscores in
Javascript but what is considered a word character is
dependant on the engine/language used \D - is equivilant to
[^\d] (negated number) \W - is equivilant to
[^\w] (negated word character) \S - guess!
.+ - the dot character matches any single character except
line breaks. When one adds a plus sign one can match paragraphs!
02/32/03 | 021203 | 02512503 | 02-12-03
\d\d.\d\d.\d\d - using a dot to match a date string with
a variable seperator. It may seem ok to using a dot as catch-all for the
seperator but as you can see this will match ANY character between the
numbers [01]\d[- / .][0-3]\d[- / .]\d\d - to match a date
you would also have to take into consideration dates that cannot exist
eg. 50/12/03. Here we specify that the first character can only be 0 or
1 followed by another number. Unfortunately this pattern falls down at
the next stage. The day allows for numbers between 0-3 meaning a day of
‘39’ could be valid. Let’s return to this later.
abc | 89790983
^a - a caret in this instance is referred to as an
anchor. This sets the position of the search to a something that starts
with an ‘a’. c$ - like the caret above, this checks for a
string that ends with a ‘c’. ^\s+$ - match leading and
trailing whitespace
This island is beautiful
\bis\b - denotes a word boundry that functions like a
caret or dollar sign. Using this allows one to preform a whole word
search
Cat Dog Mouse Fish
cat|dog - alternation, match either cat or dog using a
pipe cat|dog|mouse|fish - add more alternatives to the mix
\b(cat|dog)\b - search for a whole word or either cat or
dog between word boundries. One must use parenthesis for the search
terms otherwise the search will look for at OR dog etValue Set Get
GetValue
\b(get|set)(value)?\b - search for function names that
start with either ‘get’ or ‘set’ and optionally are followed by ‘value’.
These terms are surrounded by word boundries.
colour color creamcolour Nov November
colou?r - not sure if the text recieved is from a Briton
or an American? No problem. The question mark denotes an optional
expression. nov(ember)? - the month may(sic) be written in
shorthand or fully. Use the optional notation to grab both.
colou{0,1}r - an alternative way to write optional
expressions using curly braces
<section> <html> <body> <p><a> </section>
<[a-z A-Z][a-z A-Z 0-9]*> - match html tags using the
star sign. The star sign repeats the second character class or none at
all. This is important as the ’
’ tag will be matched by the first character class but not the
second. The angled brackets have no special meaning so do not need to be
escaped. </?[a-z A-Z][a-z A-Z 0-9]*> - add an
optional backspace (with question mark) to get closing tags too!
1000 | 9999 | 100 | 99999
\b[1-9][0-9]{3}\b - to limit the number of repartitions
one can use curly braces. The format is {*min*, *max*}. The
example here searches for a number bewteen 1 and 9 followed by 3
additional numbers between 0-9. This allows us to find any number
between 1000 and 9999. The word boundries ensure whole numbers.
\b[1-9][0-9]{2,4}\b - this expression will match any number
between 100 and 99999. There must be a minimum of two additional numbers
and a maximum of 4 giving the range.
This is a <EM>first</EM> test
<.+> - if we want to grab html tags using a token
dot and plus sign the search engine will continue to match all
characters after the first angled bracket including the closing bracket
and the rest of the string. It then procededs to ‘backtrack’ until it
finds the closing bracket giving us a value of ‘first’. Not
what we want. <.+?> - using a question mark negates
the ‘greediness’ with ‘laziness’. Using a lazy search pattern tells the
engine to do as little as possible. In this case it will match the first
suitable character after the opening angled bracket. It then procedes to
check the closing angled bracket. <[^>]+> -
instead of using an expression that backtracks, which is bad for
performance purposes, one can use a negated character class. This
expression will search all characters that are not closing angled
brackets.
color=red color=blue color=green axaxax bxbxbx cxcxcx
color=(?:red|blue|green)? - capturing groups, denoted by
parenthesis, can be converted into non-capturing groups by adding a
question mark, colon to the start of the group and then using another
question mark at the end of the group ([a-z])x\1x\1x -
using a backslash and a number allows one to reference capturing groups.
This can be used to reuse expressions within the same expression. The
numbers refer to the order used from left to right so the first
capturing group from the left will be \1 and so forth. In this example
we are capturing any character between a and c and then reusing that
group twice with the \1 reference.
Testing <B class=“king”><I class=“ming”>bold italic</I></B>
<([A-Z][A-Z 0-9]*)\b[^>]*>.*?</\1> - this
scary looking pattern grabs opening and closing tags with all content
inbetween. We’re not quite ready to fully underdtand this but stick with
it. The star doesn’t need to match. The [^>] advances through the
string and allows us to capture tag attributes. When we’re ready we can
write this stuff from memory.
The The
\b(\w+)\s+\1\b - using capturing groups we can remove
repeated words. Here we search for a word between boundries that have at
least one space before the repeated word. Using a ‘find and replace’ in
a text editor allows one to simply replace the duplicates with
\1; the capturing group!
PRACTICE!