An Introduction to GREP

What is GREP?

GREP is search tool on the command line. It searches through files and or standard input. That’s it, UNIX-style!

How did GREP get it’s name

In order to search for a regular expression in the terminal and print out the results one had to type /$SEARCHTERM/p. To make this search global one then added the g flag like so : g/$SEARCHTERM/p. This became such a regular occurrence that someone created a smaller program that only performed these global regular expressions. The name GREP merely describes the search pattern : g/re/p (global/regular expression/print).

GREP flags

A description of GREPS flags.

The Wildcard Character

GREP uses regular expressions to complete searches but let’s start with the wildcard.

>cat file

big
bad bug
bag
bigger
boogy

>grep b.g file
big
bad bug
bag
bigger

‘Boogy’ doesn’t match in this case because the wildcard matches precisely one character.

The Repetition Character

We use the astrix to repetitions of characters. Here is the description of how it works :

the expression consisting of a character followed by a star matches any number (possibly zero) of repetitions of that character. In particular, the expression “.*” matches any string, and hence acts as a “wildcard”.

Observe :

>cat file
big
bad bug
bag
bigger
boogy

>grep "b.*g" file
big
bad bug
bag
bigger
boogy

>grep "b.*g."
bigger
boogy

>grep "ggg*" file
bigger

The repetition character does not behave as a wildcard in GREP and it matches zero or more characters. The pattern “g*” matches the string ““,”g”, “gg”, etc. Likewise, the pattern “gg*” matches “g”, “gg”, “ggg”, so “ggg*” matches “gg”, “ggg”, “gggg” and so on and so forth.

Taking it Further - Regular Expressions

Wildcards are a start but the idea can be taken further. For example, suppose we want an expression that matches Frederic Smith or Fred Smith. In other words, the letters eric are optional.

First, we introduce the concept of an “escaped” character.

An escaped character is a character preceded by a backslash. The preceding backslash does one of the following: (a) removes an implied special meaning from a character (b) adds special meaning to a “non-special” character

Examples

To search for a line containing text hello.gif, the correct command is grep 'hello\.gif' file. Since grep 'hello.gif' file will match lines containing “hello-gif”, “hello1gif”, “helloagif” , etc.

Now we move on to grouping expressions, in order to find a way of making an expression to match Fred or Frederic. First we start with the ? operator.

An expression of a character followed by an escaped question mark matches one or zero instances of that character.

bugg\?y matches all of the following : “bugy”, “buggy” but not “bugggy”.We move onto “grouping” expressions. In our example, we want to make the string “ederic” following “Fred” optional, we don’t just want one optional character.

An expression surrounded by “escaped” parenthesis is treated by a single character.

Fred\(eric\)\? Smith matches “Fred Smith” or “Frederic Smith”. \(abc\)* matches “abc”, “abcabcabc” etc. It’s worth pointing out at this moment that we need to enclose the search term in quotes so that the shell doesn’t misinterpret white spaces or stars. The previous example would search for “Fred/eric” in the file “Smith”.

More on Regular Expressions

Matching a List of Characters

To match a selection of characters use []. [Hh]ello matches lines containing “hello” or Hello”. Ranges of characters are also permitted.

[0-3] is the same as [0123]
[a-k] is the same as [abcdefghijk]
[A-C] is the same as [ABC]
[A-Ca-k] is the same as [ABCabcdefghijk]
[[:alpha:]] is the same as [a-zA-Z]
[[:upper:]] is the same as [A-Z]
[[:lower:]] is the same as [a-z]
[[:digit:]] is the same as [0-9]
[[:alum::]] is the same as [0-9a-zA-z]
[[:space:]] matches any white space including tabs

The alternate forms are preferable to the direct methods. Also note that [] can be negated by inputing a caret^ as the first character.

grep "([^()]*)a" file returns any line containing a pair of parenthesis that are innermost and are followed by the letter “a”. It would match these lines :

(hello)a (asdfasdfasdf asdf ffasdfsdf)a

But not :

x=(y+2(x+1))a

Matching a Specific Number of Repetitions of a Pattern

In order to limit the number of repetitions to find in a pattern we use curly braces {}. To search for a 7 digit phone number you could try this : grep "[[:digit::]]\{3\}[ -]\?[[:digit:]]\{4\}" file. This will match any 3 numbers that are suceded by and optional whitespace or hyphen and then a further 7 numbers.

Nailing it Down to Start of the Line and End of the Line

So here’s what we want: we need a line of text with the word ‘hello’ preceded by some whitespace and nothing after it. Let’s look at a simple example :

>cat file 
    hello
hello world
    hhello
>grep hello file
    hello
hello world
    hhellow

What went wrong? GREP simply returned any lines with ‘hello’ in it. We need to be more specific to get what we want. > The $ character matches the end of the line. The ^ character matches the beginning of the line.

Let’s change the GREP command above to one which will work. grep "^[[:space:]]*hello[[:space:]]*$" file will return one line, based on the previous example, but would also return ‘hello’ without any whitespace at the start. Admittedly this is confusing because it is made out that the whitespace at the start is essential rather than optional (uses ‘*’). grep "^From.*mscharmi" /var/spool/mail/elflord is another example that searches the mail folder for headers from a specific person. Surely one can see how this could be useful?

This or That: Matching One of Two Strings

The expression consisting of two expressions separated by the or operator | matches lines containing either of those two expressions.

Nb. This must be enclosed within single or double quotes. grep "cat\|dog" file matches the word ‘cat’ or ‘dog’. grep "I am a \(cat\|dog\)" file matches lines containing the string “I am a cat” and “I am a dog”.

Backpedalling and Backreferences

How would one search for a certain substring that appears in more than one place? An example is the heading tag in HTML. To search for all heading tags, H1-6, could be written as <H[1-6]>.*</H[1-6]> doesn’t work fully as we might end up matching incorrectly paired headers. To match correctly paired tags we need to use backreferences.

The expression where n is a number, matches the contents of the n’th set of parenthese in the expression.

<H\([1-6]\).*</H\1> matches what we were trying to match before. The escaped ‘1’ after the second ‘H’ refers to the first group of the pattern. Groups are defined by parenthesis and in this case we have captured the number that sucedes the opening ‘H’ tag. We then reference that group in the closing tag.

Some Crucial Details

Special Characters

Certain characters when used with GREP need to be escaped. It is also worth pointing out at this time that EGREP is a similar tool that utilises extended regular expressions, though they are no more functional than GNU GREP, and have a greater list of metacharacters that need escaping. The following characters need to be escaped :

?   . [ ] ^ $

Quotes

Single quotes are the safest to use as they protect the regular expression from the shell. For example grep "!" file will often produce and error as the shell thinks that “!” is referring to the shell history command. On the other hand if one is want to use shell variables in the search then it is necessary to use double quotes like grep "$HOME" file. Should you try grep '$HOME' file instead you will search file for the string ‘$HOME’ rather then the variable value.

Extended Regular Expression Syntax

We previously mentioned the existence of of egrep that allows extended regular expression. Funnily enough egrep actually has less functionality as it is designed for compatibility with traditional egrep. A better way to run an extended GREP is to use the ‘-E’ flag.

grep grep -E used in egrep?
a\+ a+ yes
a\? a? yes
expression1\|expression2 expression1|expression2? yes
\(expression1\) (expression1) yes
\{m,n\} {m,n} no
\{,n\} {,n} no
\{m,} {m,} no
\{m} {m} no

The Simple Example

GREP is usually first used to search through the contents of their files. To find the file that contained the password to another computer you could run grep password *.

The output will contain all files and all lines where the search term is found e.g. 

notes : password for the system "bigvax" is "guest", remember to
notes : delete this message, as it is a bad idea to keep passwords
message : Do you know the password for bigvax? I forgot what

The above example found two files that contained the term and one of those files contained it twice.

Search for Uppercase and Lowercase Words

The previous search would only match the word password and password exactly. To make the search case-insensitive use the i flag like so grep -i password *.

Using GREP as a filter

GREP can be used on standard input to filter. File names won’t be outputed as GREP won’t know what the name is in this instance. Example: cat document.txt | grep -i $SEARCHTERM. This example is very useful but nonetheless it pipes the content of document.txt into GREPand makes a case-insensitive search for the search term.

Forcing GREP to Print a Filename

GREP does not print the filename is one single argument is specified. For example : grep password message would output Do you know the password for bigvax? I forgot what.

The output was not prefixed with the filename as it had been in the earlier example. To make sure a filename is printed one must provide at least two files to search. Why would you do this if you only want to search one specific file? Well you could supply a file that is always there and always empty. grep password message /dev/null

This command is convenient when writing shell scripts and you do not know haw many files you will be told to search. A simple example of such a script that prints the filenames with the results would be :

#!/bin/sh
grep -i $* /dev/null

Showing Lines that don’t Contain a Pattern

A simple use of GREP is to remove lines that contain a pattern. To remove all lines that contain the word “junk,” use the -v option ” grep -v junk".

This is typically used as a filter : grep -i password * | grep -v junk. Another example is to eliminate excess lines. Suppose one wants to search for the word “every,” but does not want “everyone,” “everybody,” or “everywhere.” The following would suffice : grep every * | grep -v one | grep -v body | grep -v where.

!! | grep -v ignoreThisWord : this command is handy as you can repeat the last command and remove lines that contain certain words.

find . -print | grep -v '.old$' | grep -v '[%~]$' : This command searches for files but excludes backups and any other additional terms to ignore.

Searching for a Hyphen

Looking for certain terms can de difficult. How would one search for ‘-i’? We already know that ‘-i’ is an option one can supply to GREP. When we run grep -i file GREP will check for the term ‘file’ on standard input. This means that nothing will happen until one presses ctrl-d.

An Introduction to GREP by … UNIX and Linux : GREP by Elflord 15 Practical Grep Command Examples in Linux/Unix