Let’s Learn SED!

A Description

What is SED? The best place to look is the man page.

The sed utility reads the specified files, or the standard input if no files are specified, modifying the input as specified by a list of commands. The input is then written to the standard output. A single command may be specified as the first argument to sed. Multiple commands may be specified by using the -e or -f options. All commands are applied to the input in the order they are specified regardless of their origin.

We supply SED with text-based input from either stdin or from files, issue a singular command or several using the -e or -f flags and send the result to stdout. Seems straightforward. SED uses regular expressions to perform more complex transformations utilising the same syntax and capability as GREP. It’s worth noting that SED also has a -E flag that extends the regular expression syntax.

In case you just wanted to know what the name meant; SED is short for Stream EDitor.

Let’s Delve

SED commands are written in the following way : s/regex/replacement_text/{flags}. The ‘s’ at the start of the command is short for substitute or search and replace. Let’s start with a simple example :

>cat file 
I have three dogs and two cats
>sed -e 's/dog/cat/g' -e 's/cat/elephant/g/' file
I have three elephants and two elephants
>

Hmm. In this example we use two SED expressions that are performed in order from left to right. First we substitute all instances of ‘dog’ with ‘cat’ and then replace all instances of ‘cat’ with ‘elephant’. The result is printed to stdout. Note that the ‘g’ flag stands for global meaning that all matches will be changed in the entire document. Super! We’ve made a jolly good start.

Let’s Segue : Sed Regular Expressions

SED utilises the same regular expressions as GREP so we should be good. Let’s have a ganders at them in handy list form :

Deeper : Substitute and Delete Commands

Firstly, the way one usually uses SED is as follows :

>sed -e 'command 1' -e 'command 2' -e 'command 3' file
>{shell command} | sed -e 'command 1' -e 'command 2'
>sed -f sedscript.sed file
>{shell command} | sed -f sedscript.sed

SED can operate on files or stdin and commands can be issued on the command line or from a file. Please take heed :

that if the commands are read from a file, trailing whitespace can be fatal, in particular, it will cause scripts to fail for no apparent reason. I recommend editing sed scripts with an editor such as vim which can show end of line characters so that you can see trailing white space at the end of line.

The Substitute Command

The format for the substitute command is as follows : [address1[, address2]]s/patter/replacement/[flags].

The flags can be any of the following : * n | replace nth instance of pattern with replacement * g | replace all instances of pattern with replacement * p | write pattern space to stdout if a successful substitution takes place * w file | write the pattern space to file if a successful substitution takes place

One can issue a command without any flags which simply changes the first match in the supplied input. The addresses can be either a regular expression, enclosed by forward slashes, or a line number. When providing a single address the substitution takes place in that address while providing two addresses, separated by a comma, then the substitution occurs between the two like a range.

The Delete Command

The syntax is thus : [address 1 [, address 2]]d. This command will delete the content of the pattern space.

>cat file
http://www.foo.com/mypage.html
>sed -e 's@http://www.foo.com@http://www.bar.net@' file

http://www.bar.net/mypage.html

This command uses ‘@’ as the delimiter instead of backslashes. Sed allows @, %, ,, ; and : for use as alternatives.

>cat file
the black cat was chased by the brown dog
>sed -e 's/black/white/g' file
the white cat was chased by the brown dog
>cat file
the black cat was chased by the brown dog.
the black cat was not chased by the brown dog.
>sed -e '/not/s/black/white/g' file
the black cat was chased by the brown dog.
the white cat was not chased by the brown dog.

In this example, the substitution is only applied to lines matching the regular expression ‘not’. This is the address we talked about previously.

>cat file 
line 1 (one)
line 2 (two)
line 3 (three)

# Example 4a
>sed -e '1,2d' file
line 3 (three)
# Example 4b
>sed -e '3d' file
line 1 (one)
line 2 (two)
# Example 4c
>sed -e '1,2s/line/LINE/' file
LINE 1 (one)
LINE 2 (two)
line 3 (three)
# Example 4d
>sed -e '/^line.*one/s/line/LINE/' -e '/line/d' file
LINE 1 (one)
# Example 4e
>sed -e '/^line.*(.\{3\})/s/line/LINE/' -e '/line/d' file
LINE 1 (one)
LINE 2 (two)

Explanation : 1. (a) We remove anything within the range 1,2 leaving only the third line. 2. (b) We remove the third line explicitly. 3. (c) We substitute the word ‘line’ for the uppercase ‘LINE’ for any line in the range 1,2. 4. (d) We search for any line that begins with ‘line’, has any characters of undetermined length until the word ‘one’. We substitute the word ‘line’ for ‘LINE’ for any match then delete any line that features the lowercase ‘line’. 5. (e) We search for any line starting with ‘line’ with an undetermined number of characters in between until opening parenthesis. Within the parenthesis, there must be precisely three characters followed by closing parenthesis. The word ‘line’ is subbed for ‘LINE’ in any apparent matches.

>cat file

hello
this text is wiped out
Wiped out
hello (also wiped out)
WiPed out TOO!
goodbye
(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this
hello but this is 
and so is this 
and unless we find another g**dbye
every line to the end of the file gets deleted

>sed -e '/hello/,/goodbye/d' file

(1) This text is not deleted
(2) neither is this ... ( goodbye )
(3) neither is this

Let’s look at addressing. When we are dealing with two regular expressions as the address it’s a tad more complicated than the rather simple idea of line numbers. In this case we start by finding the word ‘hello’ and end with ‘goodbye’. All lines between are deleted. So far so good. Why then are the lines after the list deleted? Sed finds the word ‘hello’ again after ‘(3) neither is this’ but does not find another ‘goodbye’ so all lines after are deleted.

Backreferences in Sed & More advanced SED

The Quit Command

The quit, or q, command simply stops execution. This leads us on to subroutines…

Subroutines

In sed, curly braces, {} are used to group commands. They are used as follows: address1[, address2]{ commands }

Let’s explore these concepts with an example. The first thing to note is the need to utilise a shell script. We use double quotes (previously it has been single quotes) in order to use shell variables. Let’s take a look :

#!/bin/sh
X='word1\|word2\|word3\|word4\|word5'
sed -e "
/$X/!d
/$X/{
    s/\($X).*/\1/
    s/.*\($X)/\1/
    q
    }" $1

An important note: it is tempting to think of this :

s/\($X\).*/\/1/
s/.*\($X\)/\1/

as redundant, and to try and shorten it with this : s/.*\($X\).*/\1/. This is unlikely to work. Why? Suppose we have a line ‘word1 word2 word3’, we have no way of knowing that $X is going to match ‘word1, word2 or word3’ so when we quote if (\1), we don’t know what we are quoting. What has been used to make sure there are no such problems in the correct implementation is this : > the * operator is greedy. That is, when there is ambiguity as to what (expression)* can match, it tries to match as much as possible.

So in the example, s/\($X\).*/\1/, .\* tries to swallow as much of the line as possible. In particular should the line look like this, ‘word1 word2 word3’, then we can be sure that .\* matches ” word2 word3” and hence $X matches ‘word1’.

Pattern Matching Across More than 1 Line

Suppose we want to replace all instances of ‘Microsoft Windows 95’ with ‘Linux’. Our first attempt is this, s/Microsoft Windows 95/Linux/g. Unfortunately the script fails of our file looks like this, ‘Microsoft’ since neither line matches the pattern ‘microsoft Windows 95’.

The first thing we can try is the multiline next or N command. > The next command N appends the next line to the pattern space.

So our second attempt is this :

N
N
s/Microsoft[ \t\n]*Windows[ \t\n]*95/Linux/g

One can see the use of the special tags \t\n. These two tags refer to tabs and new lines respectively. What happens when we feed this pattern the following file :

Foo
Microsoft
Windows
95

Why does this break? 1. First, it read the line ‘Foo’ into the pattern space. 2. It sees the N command and appends line 2 to the pattern space. The pattern space now look like : Foo|nMicrosoft 3. Executing the second N command, it reads line 3 into the pattern space. At this stage, the pattern space looks like this : Foo|nMicrosfot|nWindows 4. Now the script runs the substitute command. Foo|nMicrosoft|nWindows This doesn’t match the search pattern, so no substitution is performed. 5. Since the end of the script is reached, the contents of the pattern space are written to STDOUT, and the script starts again from the first line. 6. The last lien of the file ‘95’ is read into pattern space. > This is the main error in the script : once the end of the script is reached, the first line that * has not been read into the pattern space already * is read. It is NOT true that the Nth iteration of the script reads from the Nth line of the file. The following too N commands fail and the script exits without writing ‘95’ to STDOUT.

So there are too things to be learned from this : * Each line of the file is read in exactly once. After you read a line into pattern space, you can not read it again. * It’s good practise to use $!N in place of N to avoid errors, since the N command doesn’t make sense on the last line of a file. A better version is as follows :

/Microsoftt[ \t]*$/{
    N
    }
/Microsoft[ \t\n]*Windows[ \t\*$/{
    N
    }
s/Microsoft[ \t\n]*Windows[ \t\n]*95/Llinux/g

This only performs the search on extra lines when necessary.

Removing Text Between Matching Pairs of Delimiters

Suppose we want to eliminate all text enclosed by a matching pair of delimiters. This is a problem that comes up frequently. For example, removing html commands from html documents. We will use in this example. So the task then is to eliminate anything between matching pairs of these brackets. Our first attempt is shown as follows : s/<[^>]*>//g.

But this might break: the angle brackets might span more than one line, or they may be nested angle brackets. Actually, the latter is unlikely to happen if the html is correct. Let’s attempt another version :

:top
/<.*>/{
s/<[^<>]*>//g
t top
}
/</{
    N
    b top
    }

A fine point: why didn’t we replace the third line of the script with s/<[^>]*>//g and removing the t command that follows? Well consider this sample file :

<<hello>
hello>

The desired output would be the empty set, since everything is enclosed in angled brackets. However, the output will look like this : hello> since the first line matches the expression <[^>]*>. So the point is that we have set up the script to recursively remove the contents of the innermost matching pair of delimiters brackets.

Replacing in-place

SED usually makes it’s changes/substitutions and outputs to stdout. Perhaps one would like to change the source material in-place? We have a flag for that: -i. Let’s see what the man pages tell us : > Edit files in-place, saving backups with the specified extension Very interesting. How about we try it on for size :

>ls
greetings.txt
>cat greetings.txt
hello
hi there
>sed -i .bak 's/hello/bonjour/' greetings.txt
>ls
greetings.txt
greetings.txt.bak
cat greetings.txt
bonjour
hi there
>cat greetings.txt.bak
hello
hi there

Try it now! You see how it changed while the backup serves its purpose. Yay!

Let’s Get l33t

There’s some more in the man entry about the -i flag : > If a zero-length extension is given, no backup will be saved. It is not recommended to give a zero-length extension when in-place editing files, as you risk corruption or partial content in situations where disk space is exhausted, etc.

Hmm. So if we do sed -i '' 's/hello/bonjour/' greetings.txt we can write over the original file. Cool.

sed 102: Replace In-Place
sed Tutorial