In grad school, I once saw a prof I was working with grab a text file and in seconds manipulate it into little pieces so deftly it blew my mind. I immediately decided it was time for me to learn awk, which he had so clearly mastered.

To this day, 90% of the programmers I talk to have never used awk. Knowing 10% of awk’s already small syntax, which you can pick up in just a few minutes, will dramatically increase your ability to quickly manipulate data in text files. Below I’ll teach you the most useful stuff - not the “fundamentals”, but the 5 minutes worth of practical stuff that will get you most of what I think is interesting in this little language.

Awk is a fun little programming language. It is designed for processing input strings. A (different) prof once asked my networking class to implement code that would take a spec for an RPC service and generate stubs for the client and the server. This professor made the mistake of telling us we could implement this in any language. I decided to write the generator in Awk, mostly as an excuse to learn more Awk. Surprisingly to me, the code ended up much shorter and much simpler than it would have been in any other language I’ve ever used (Python, C++, Java, …). There is enough to learn about Awk to fill half a book, and I’ve read that book, but you’re unlikely to be writing a full-fledged spec parser in Awk. Instead, you might want to do things like find all of your log lines that come from ip addresses whose components sum up to 666, for kicks and grins. Read on!

For our examples, assume we have a little file (logs.txt) that looks like the one below. If it wraps in your browser, this is just 2 lines of logs each starting with an ip address.

These are just two log records generated by Apache, slightly simplified, showing Bing and Baidu bots wandering around on my site.

Awk works like anything else (ie: grep) on the command line. It reads from stdin and writes to stdout. It’s easy to pipe stuff in and out of it. The command line syntax you care about is just the command awk followed by a string that contains your program.

Most Awk programs will start with a “{” and end with a “}”. Everything in between there gets run once on each line of input. Most awk programs will print something. The program above will print the entire line that it just read, print appends a newline. $0 is the entire line. So this program is an identity operation - it copies the input to the output without changing it.

Awk parses the line in to fields for you automatically, using any whitespace (space, tab) as a delimiter, merging consecutive delimiters. Those fields are available to you as the variables $1, $2, $3, etc.

echo 'this is a test' | awk '{print $3}'  // prints 'a'
awk '{print $1}' logs.txt
# Output:
# 07.46.199.184
# 123.125.71.19

Easy so far, and already useful. Sometimes I need to print from the end of the string though instead. The special variable, NF, contains the number of fields in the current line. I can print the last field by printing the field $NF or I can just manipulate that value to identify a field based on it’s position from the last. I can also print multiple values simultaneously in the same print statement.

echo 'this is a test' | awk '{print $NF}'  // prints "test"
awk '{print $1, $(NF-2) }' logs.txt
# Output:
# 07.46.199.184 200
# 123.125.71.19 304

More progress - you can see how, in moments, you could strip this log file to just the fields you are interested in. Another cool variable is NR, which is the row number being currently processed. While demonstrating NR, let me also show you how to format a little bit of output using print. Commas between arguments in a print statement put spaces between them, but I can leave out the comma and no spaces are inserted.

awk '{print NR ") " $1 " -> " $(NF-2)}' logs.txt
# Output:
# 1) 07.46.199.184 -> 200
# 2) 123.125.71.19 -> 304

Powerful, but nothing hard yet, I hope. By the way, there is also a printf function that works much the way you’d expect if you prefer more formatting options. Now, not all files have fields that are separated with whitespace. Let’s look at the date field:

awk '{print $2}' logs.txt
# Output:
# [28/Sep/2010:04:08:20]
# [28/Sep/2010:04:20:11]

The date field is separated by “/” and “:” characters. I can do the following within one awk program, but I want to teach you simple things that you can string together using more familiar unix piping because it’s quicker to pick up a small syntax. What I’m going to do is pipe the output of the above command through another awk program that splits on the colon. To do this, my second program needs two {} components. I don’t want to go into what these mean, just to show you how to use them for splitting on a different delimiter.

awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}'
# Output:
# [28/Sep/2010
# [28/Sep/2010

I just specified that I wanted a different FS (field separator) of “:” and that I wanted to then print the first field. No more time, just dates! The simplest way to get rid of that prefix [ character is with sed, which you are likely already familiar with:

awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}' | sed 's/\[//'
# Output:
# 28/Sep/2010
# 28/Sep/2010

I can further split this on the “/” character if I want using the exact same trick, but I think you get the point. Next, lets learn just a tiny bit of logic. If I want to return only the 200 status lines, I could use grep, but I might end up with an ip address that contains 200, or a date from year 2000. I could first grab the 200 field with Awk and then grep, but then I lose the whole line’s context. Awk supports basic if statements. Lets see how I might use one:

awk '{if ($(NF-2) == "200") {print $0}}' logs.txt
# Output:
# 07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"

There we go, returning only the lines (in this case only one) with a 200 status. The if syntax should be very familiar and require no explanation. Let me finish up by showing you one stupid example of awk code that maintains state across multiple lines. Lets say I want to sum up all of the status fields in this file. I can’t think of a reason I’d want to do this for statuses in a log file, but it makes a lot of sense in other cases like summing up the total bytes returned across all of the logs in a day or something. To do this, I just create a variable which automatically will persist across multiple lines:

awk '{a+=$(NF-2); print "Total so far:", a}' logs.txt
# Output:
# Total so far: 200
# Total so far: 504

Nothing doing. Obviously in most cases, I’m not interested in cumulative values but only the final value. I can of course just use tail -n1, but I can also print stuff after processing the final line using an END clause:

awk '{a+=$(NF-2)}END{print "Total:", a}' logs.txt
# Output:
# Total: 504

If you want to read more about awk, there are several good books and plenty of online references. You can learn just about everything there is to know about awk in a day with some time to spare. Getting used to it is a bit more of a challenge as it really is a little bit different of a way to code - you are essentially writing only the inner part of a for loop. Come to think of it, this is a lot like how MapReduce feels, which is also initially disorienting.

I hope some of that was useful. If you found it to be so, let me know; I enjoy the feedback if nothing else.

Awk by example

An intro to the great language with the strange name

In this series of articles, I’m going to turn you into a proficient awk coder. I’ll admit, awk doesn’t have a very pretty or particularly “hip” name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear “awk” and think of a mess of code so backwards and antiquated that it’s capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp “kill -9!” as he runs for coffee machine).

Sure, awk doesn’t have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk’s syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.

The first awk

Let’s go ahead and start playing around with awk to see how it works. At the command line, enter the following command:

awk '{ print }' /etc/passwd

You should see the contents of your /etc/passwd file appear before your eyes. Now, for an explanation of what awk did. When we called awk, we specified /etc/passwd as our input file. When we executed awk, it evaluated the print command for each line in /etc/passwd, in order. All output is sent to stdout, and we get a result identical to catting /etc/passwd. Now, for an explanation of the { print } code block. In awk, curly braces are used to group blocks of code together, similar to C. Inside our block of code, we have a single print command. In awk, when a print command appears by itself, the full contents of the current line are printed.

awk '{ print $0 }' /etc/passwd

In awk, the $0 variable represents the entire current line, so print and print $0 do exactly the same thing. If you’d like, you can create an awk program that will output data totally unrelated to the input data. Here’s an example:

awk '{ print "" }' /etc/passwd

Whenever you pass the “” string to the print command, it prints a blank line. If you test this script, you’ll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here’s another example:

awk '{ print "hiya" }' /etc/passwd

Multiple fields

Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:

awk ‑F":" '{ print $1 }' /etc/passwd

Above, when we called awk, we use the -F option to specify “:” as the field separator. When awk processes the print $1 command, it will print out the first field that appears on each line in the input file. Here’s another example:

awk ‑F":" '{ print $1 $3 }' /etc/passwd

halt7 
operator11 
root0 
shutdown6 
sync5 
bin1 
....etc.

As you can see, awk prints out the first and third fields of the /etc/passwd file, which happen to be the username and uid fields respectively. Now, while the script did work, it’s not perfect — there aren’t any spaces between the two output fields! If you’re used to programming in bash or python, you may have expected the print $1 $3command to insert a space between the two fields. However, when two strings appear next to each other in an awk program, awk concatenates them without adding an intermediate space. The following command will insert a space between both fields:

awk ‑F":" '{ print $1 " " $3 }' /etc/passwd

When you call print this way, it’ll concatenate $1, " ", and $3, creating readable output. Of course, we can also insert some text labels if needed:

awk ‑F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwd

username: halt     uid:7 
username: operator uid:11 
username: root     uid:0 
username: shutdown uid:6 
username: sync     uid:5 
username: bin      uid:1 
....etc.

External scripts

Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you’ll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:

awk ‑f myscript.awk myfile.in

Putting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:

BEGIN { 
        FS=":" 
} 
{ print $1 }

The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F”:” option to awk on the command line. It’s generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We’ll cover the FS variable in more detail later in this article.

The BEGIN and END blocks

Normally, awk executes each block of your script’s code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it’s an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you’ll reference later in the program.

Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.

Regular expressions and blocks

Awk allows the use of regular expressions to selectively execute an individual block of code, depending on whether or not the regular expression matches the current line. Here’s an example script that outputs only those lines that contain the character sequence foo:

Of course, you can use more complicated regular expressions. Here’s a script that will print only lines that contain a floating point number:

/[0‑9]+\.[0‑9]∗/ { print }

Expressions and blocks

There are many other ways to selectively execute a block of code. We can place any kind of boolean expression before a code block to control when a particular block is executed. Awk will execute a code block only if the preceding boolean expression evaluates to true. The following example script will output the third field of all lines that have a first field equal to fred. If the first field of the current line is not equal to fred, awk will continue processing the file and will not execute the print statement for the current line:

$1 == "fred" { print $3 }

Awk offers a full selection of comparison operators, including the usual “==”, “<”, “>”, “<=”, “>=”, and “!=”. In addition, awk provides the “~” and “!~” operators, which mean “matches” and “does not match”. They’re used by specifying a variable on the left side of the operator, and a regular expression on the right side. Here’s an example that will print only the third field on the line if the fifth field on the same line contains the character sequence root:

$5 ~ /root/ { print $3 }

Conditional statements

Awk also offers very nice C-like if statements. If you’d like, you could rewrite the previous script using an if statement:

{ 
  if ( $5 ~ /root/ ) { 
          print $3 
  } 
}

Both scripts function identically. In the first example, the boolean expression is placed outside the block, while in the second example, the block is executed for every input line, and we selectively perform the print command by using an if statement. Both methods are available, and you can choose the one that best meshes with the other parts of your script.

Here’s a more complicated example of an awk if statement. As you can see, even with complex, nested conditionals, if statements look identical to their C counterparts:

{ 
  if ( $1 == "foo" ) { 
       if ( $2 == "foo" ) { 
                print "uno" 
       } else { 
                print "one" 
       } 
  } else if ($1 == "bar" ) { 
       print "two" 
  } else { 
       print "three" 
  } 
}

Using if statements, we can also transform this code: ! /matchme/ { print $1 $3 $4 }

{ 
  if ( $0 !~ /matchme/ ) { 
          print $1 $3 $4 
  } 
}

Both scripts will output only those lines that don’t contain a matchme character sequence. Again, you can choose the method that works best for your code. They both do the same thing.

Awk also allows the use of boolean operators “||” (for “logical or”) and “&&”(for “logical and”) to allow the creation of more complex boolean expressions:

( $1 == "foo" ) && ( $2 == "bar" ) { print }

This example will print only those lines where field one equals fooand field two equals bar.

Numeric variables!

So far, we’ve either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it’s very easy to write a script that counts the number of blank lines in a file. Here’s one that does just that:

BEGIN { x=0 } 
/^$/  { x=x+1 } 
END   { print "I found " x " blank lines. :)" }

In the BEGIN block, we initialize our integer variable x to zero. Then, each time awk encounters a blank line, awk will execute the x=x+1 statement, incrementing x. After all the lines have been processed, the END block will execute, and awk will print out a final summary, specifying the number of blank lines it found.

Stringy variables

One of the neat things about awk variables is that they are “simple and stringy.” I consider awk variables “stringy” because all awk variables are stored internally as strings. At the same time, awk variables are “simple” because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:

x="1.01" 
##We just set x to contain the ∗string∗ "1.01" 
x=x+1 
##We just added one to a ∗string∗ 
print x 
##Incidentally, these are comments :)

Interesting! Although we assigned the string value 1.01 to the variable x, we were still able to add one to it. We wouldn’t be able to do this in bash or python. First of all, bash doesn’t support floating point arithmetic. And, while bash has “stringy” variables, they aren’t “simple”; to perform any mathematical operations, bash requires that we enclose our math in an ugly $( ) construct. If we were using python, we would have to explicitly convert our 1.01 string to a floating point value before performing any arithmetic on it. While this isn’t difficult, it’s still an additional step. With awk, it’s all automatic, and that makes our code nice and clean. If we wanted to square and add one to the first field in each input line, we would use this script: { print ($1^2)+1 }

If you do a little experimenting, you’ll find that if a particular variable doesn’t contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.

Lots of operators

Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator “^”, the modulo (remainder) operator “%”, and a bunch of other handy assignment operators borrowed from C.

These include pre- and post-increment/decrement ( i++, --foo), add/sub/mult/div assign operators ( a+=3, b*=2, c/=2.2, d-=6.2 ). But that’s not all — we also get handy modulo/exponent assign ops as well ( a^=2, b%=4 ).

Field separators

Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We’ve already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to “:”. While this did the trick, FS allows us even more flexibility.

The FS value is not limited to a single character; it can also be set to a regular expression, specifying a character pattern of any length. If you’re processing fields separated by one or more tabs, you’ll want to set FS like so: FS="\t+".

Above, we use the special “+” regular expression character, which means “one or more of the previous character”.

If your fields are separated by whitespace (one or more spaces or tabs), you may be tempted to set FS to the following regular expression: FS="[[:space:]+]".

While this assignment will do the trick, it’s not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean “one or more spaces or tabs.” In this particular example, the default FS setting was exactly what you wanted in the first place!

Complex regular expressions are no problem. Even if your records are separated by the word “foo,” followed by three digits, the following regular expression will allow your data to be parsed properly: FS="foo[0‑9][0‑9][0‑9]".

Number of fields

The next two variables we’re going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the “number of fields” variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:

NF == 3 { print "this particular record has three fields: " $0 }

Of course, you can also use the NF variable in conditional statements, as follows:

{ 
  if ( NF > 2 ) { 
    print $1 " " $2 ":" $3 
  } 
}

Record number

The record number (NR) is another handy variable. It will always contain the number of the current record (awk counts the first record as record number 1). Up until now, we’ve been dealing with input files that contain one record per line. For these situations, NR will also tell you the current line number. However, when we start to process multi-line records later in the series, this will no longer be the case, so be careful! NR can be used like the NF variable to print only certain lines of the input: (NR < 10 ) || (NR > 100) { print "We are on record number 1‑9 or 101+" }.

{ 
  #skip header 
  if ( NR > 10 ) { 
    print "ok, now for the real information!" 
  } 
}

Awk provides additional variables that can be used for a variety of purposes. We’ll cover more of these variables in later articles. We’ve come to the end of our initial exploration of awk. As the series continues, I’ll demonstrate more advanced awk functionality, and we’ll end the series with a real-world awk application. In the meantime, if you’re eager to learn more, check out the resources listed below.

Learn Awk in Y Minutes

AWK is a standard tool on every POSIX-compliant UNIX system. It’s like a stripped-down Perl, perfect for text-processing tasks and other scripting needs. It has a C-like syntax but without semicolons, manual memory management or static typing. It excels at text processing. You can call to it from a shell script or you can use it as a stand-alone scripting language.

Why use AWK instead of Perl? Mostly because AWK is part of UNIX. You can always count on it, whereas Perl’s future is in question. AWK is also easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.

Starting a Script && Operators

AWK programs consist of a collection of patterns and actions. The most important pattern is called BEGIN. Actions go into brace blocks. BEGIN will run at the beginning of the program. It’s where you put all the preliminary set-up code before you process any text files. If you have no text files, then think of BEGIN as the main entry point. Variables are global. Just set them or use them, no need to declare…

#!/usr/bin/awk -f

# Comments are like this

BEGIN {
    count = 0

    # Operators just like in C and friends

    a = count + 1
    b = count - 1
    c = count * 1
    d = count / 1 # integer division
    e = count % 1 # modulus
    f = count ^ 1 # exponentiation

    a += 1
    b -= 1
    c *= 1
    d /= 1
    e %= 1
    f ^= 1

    # Incrementing and decrementing by one

    a++
    b--

    # As a prefix operator, it returns the incremented value
    ++a
    --b

    # Notice, also, no punctuation such as semicolons to terminate statements

    # Control Statements
    if ( count == 0 )
        print "Starting with count of 0"
    else
        print "Huh?"

    # Or you could use the ternary operator
    print ( count == 0 ) ? "Starting with count of 0" : "Huh?"

    # Blocks consisting of multiple lines use braces
    while ( a < 10 ){
        print "String concatenation is done" " with a series" " of" " space-separated strings"
        print a

        a++
    }

    for ( i = 0; i < 10; i++ )
        print "Good ol' for loop"

    # As for comparisons, they're the standards:
    a < b   # Less than
    a <= b  # Less than or equal
    a != b  # Not equal
    a > b   # Greater than
    a >= b  # Greater than or equal to

    # Logical operators as well
    a && b  # AND
    a !! b  # OR

    # In addition, there's the super useful regular expression match 
    if ("foo" ~ "^fo+$")
        print "Fooey!"
    if ("boo" !~ "^fo+$")
        print "Boo!"

Arrays & Arguments

    arr[0] = "foo"
    arr[1] = "bar"
    # Unfortunately, there is no other way to initialise an array.
    # You just have to chug through value line like that.
    # You also have associative arrays
    assoc["foo"] = "bar"
    assoc["bar"] = "baz"

    # And multi-dimensional arrays, with some limitations I won't mention here
    multidim[0,0] = "foo"
    multidim[0,1] = "bar"
    multidim[1,0] = "baz"
    multidim[1,1] = "boo"

    # You can use the "in" operator to traverse the keys of an array
    for (key in assoc)
        print assoc[key]

    # The command line is in a special array called ARGV
    for (argnum in ARGV)
        print ARGV[argnum]

    # You can remove elements of an array
    # This is particularly useful to prevent AWK from assuming the arguments are files for it to process
    delete ARGV[1]

    # The number of command line arguments is in a variable called ARGC
    print ARGC

Function

AWK has several built-in functions. They fall into three categories. I’ll demonstrate each of them in their own functions defined later.

    return value = arithmetic_functions(a, b, c)
    string_functions()
    io_functions()
}

Arithmetic Function

Probably the most annoying part of AWK is that there are no local variables. Everything is global. For short scripts, this is fine, even useful, but for longer scripts, this can be a problem. There is a work-around (ahem, hack). Functions arguments are local to the function, and AWK allows you to define more function arguments that it needs. So just stick local variables in the function declaration, like I did above. As a convention, stick in some extra whitespace to distinguish between actual function parameters and local variables. In this example, a, b and c are actual parameters while d is merely a local variable.

# Here's how you define a function
function arithmetic_functions(a, b, c,    d){
    # Now to demonstrate the arithmetic functions
    # Most AWK implementations have some stadard trig functions
    localvar = sin(a)
    localvar = cos(a)
    localvar = atan2(a, b) # arc tangent of b / a

    # And logarithmis stuff
    localvar = exp(a)
    localvar = log(a)

    # Square root
    localvar = sqrt(a)

    # Truncate floating point to integer
    localvar = int(5.34) # localvar => 5

    # Random numbers
    srand() # Supply a seed as an argument. By default, it uses the time of day
    localvar = rand() # Random number between 0 and 1

    # Here's how to return a value
    return localvar
}

String Functions

AWK, being a string-processing language, has several string-related function, many of which rely heavily on regular expressions.

function string_functions(  localvar, arr){
    # Search and replace, first instance (sub) or all instances (gsub)
    # Both return number of matches replaced
    localvar = "fooooobar"
    sub("fo+", "Meet me at the ", localvar) # localvar => "Meet me at the bar"
    gsub("e+", ".", localvar) # "m..t m. at th. bar"

    # Search for a string that matches a regular expression
    # index() does the same thing, but doesn't allow a regular expression
    match(localvar, "t") # => 4, since the 't' is the fourth character
    # Split on a delimiter
    split("foo-bar-baz", arr, "-") # a => ["foo", "bar", "baz"]

    # Other useful stuff
    sprintf("%s %d %d %d", "Testing", 1, 2, 3) # => "Testing 1 2 3
    substr("foobar", 2, 3) # => "oob"
    substr("foobar", 4) # => "bar"
    length("foo") # => 3
    tolower("FOO") # => "foo"
    toupper("foo") # => "FOO"
}

I/O Functions

AWK doesn’t have file handles, per se. It will automatically open a file handle for when you use something that needs one. The string you used for this can be treated as a file handle, for purposes of I/O. This makes it feel sort of like shell scripting:

function io_functions(  localvar){
    # You've already seen print
    print "Hello world"

    # There's also printf
    printf("%s %d %d %d\n", "Testing", 1, 2, 3)

    print "foobar" > "/tmp/foobar.txt"

    # Now the string "/tmp/foobar.txt" is a file handle. You can close it:
    close("/tmp/foobar.txt")

    # Here's how you run something in the shell
    system("echo foobar") # => prints foobar

    # Reads a line from standard input and stores in localvar 
    getline localvar
    
    # Reads a line from a pipe
    "echo foobar" | getline localvar # localvar => "foobar"
    close("echo foobar")

    # Reads a line from a file and stores in localvar
    getline localvar <"/tmp/foobar.txt"
    close("/tmp/foobar.txt")
}

Patterns & Actions

As I said at the beginning, AWK programs consist of a collection of patterns and actions. You’ve already seen the all-important BEGIN pattern. Other patters are used only if you’re processing lines from files or standard input.

When you pass arguments to AWK, they are treated as file names to process. It will process them all, in order. Think of it like an implicit for loop, iterating over the lines in these files. These patterns and actions are like switch statements inside the loop. /^fo+bar$/ { this action will execute for every line that matches the regular expression and will be skipped for any line that fails to match it. Let’s just print the line: print.

Whoa, no argument! That’s because print has a default argument: $0. This is the name of the current line being processed. It is created automatically for you. You can probably guess there are other $ variables. Every line is implicitly split before every action is called, much like the shell does. And, like the shell, each field can be accessed with a dollar sign.

print $2, $4 will print the second and fourth fields. AWK automatically defines many other variables to help you inspect and process each line. The most important one is NF. print NF prints the number of fields on this line whilst print $NF prints the last field.

Every pattern is actually a boolean test. The regular expression in the last pattern is also a boolean test but part of it was hidden. I don’t give it a string to test, it will assume $0, the line that it’s currently processing. Thus, the complete version of it is this:

$0 ~ /^fo+bar$/ {
    print "Equivalent to the last pattern"
}

a > 0 {
    # This will execute once for each line, as long as a is positive
}

You get the idea. Processing text files, reading in a line at a time and doing something with it, particularly splitting on a delimiter, is so common in UNIX that AWK is a scripting language that does all fo it for you, without you needing to ask. All you have to do is write the patterns and actions based on what you expect of the input, and what you want to do with it.

Here’s a quick example of a simple script, the sort of thing AWK is perfect for. It will read a name from standard input and then will print the average age of everyone with that first name. Let’s say you supply as an argument the name of this data file:

BEGIN {
    # First, ask the user for the name
    print "What name would you like the average age for?"

    # Get a line from standard input, not from files on the command line
    getline name < "/dev/stdin"
}

# Now, match wvery line whose first field is the given name
$1 === name {
    # Inside here, we have access to a number of useful variables, already
    # pre-loaded for us:
    # $0 is the entire line
    # $3 is the third field, the age, which is what we're interested in here
    # NF is the number of fields, which should be 3
    # NR is the number of records (lines) seen so far
    # FILENAME is the name of the file being processed
    # FS is the field separator being used, which is " " here
    # …etc. There are plenty more, documented in the man page.

    # Keep track of a running total and how many lines matched 
    sum += $3
    nlines++
}

Another special pattern is called END. It will run after processing all the text files. Unlike BEGIN, it will only run if you’ve given it input to process. It will run after all the files have been read and processed according to the rules and actions you’ve provided. The purpose of it is usually to output some kind of final report, or do something with the aggregate of the data you’ve accumulated over the course of the script.

END {
    if (nlines)
        print "The average age for " name " is " sum /nlines
}

Back to the Classics - AWK

In this age of npm and github and easily available modules in any language of your choice, it is easy to forget the old Unix workhorses. Here’s a look at awk, a shell utility that allows you to treat and manipulate text files as if they were databases.

What is awk?

Awk is both the name of the command line utility, and the language used for it. It was invented at Bells Labs at the peak of punk rock, 1977, and its name is simply the initials of its three creators. Awk reads input (a file, or a stream) one line (one “record”) at the time, splits it into fields by blank space (these are all defaults that can be changed), and then uses the instructions in the awk language to manipulate these fields and generate some output. The ability to read files as streams is a big plus - it means the memory footprint is the same if you read a file of 1Kb or 200Tb; for a larger file it will just take longer.

Awk is standard with the version of bash that comes with OS X, and several others. There is another variant which is widespread - gawk, GNU awk. It is actually better than the original, because it offers array sort and length functions, the ability to include files, and more flexible rules for splitting input in fields. Here I will limit myself to the standard awk.

Example awk in action

The strong point of awk is that it automatically splits lines of text as if they were “columns” in a spreadsheet and assigns each column to a variable (a “field”). Then you can manipulate them and spit them out

# awk loads a slightly more complex program and waits
> awk '{print $3 ": " $1 + $2}'
# waiting for a <RETURN> outside a ''
10 20 Toronto
# this line is split into 3 "columns", and
# 10 is assigned to $1, 20 to $2, and Toronto to $3
# then the program {print "$3: " $1 + $2} is run - it adds $1 + $2 and
# prints the result out, with some extra text (the :)
Toronto: 30
# now it waits for the next line
20 30 Miami
# same program run on it
Miami: 50

Running awk programs and redirecting input, output

Running awk on STDIN is not very useful, but of course you can use Unix magic to redirect the input and / or output of the program

# awk will treat the second argument as a path to a file to read from
> awk '{print}' some_data.txt
... # prints whatever was in some_data.txt
>
# exactly the same thing but done differently - redirecting file to STDIN
> awk '{print}' < some_data.txt
... # prints whatever was in some_data.txt
>
# you can read several files, in order
> awk '{print}' some_data.txt more_data.txt
...
>
# now the processed data goes to a separate file
> awk '{print}' some_data.txt > result.txt
>
# the awk program itself can be loaded to a file - here this file is created
> echo '{print}' > awk.txt
# passing the command on to awk with the -f option
> awk -f awk.txt some_data.txt > result.txt
>
# mixing STDIN with files. The "-" is substituted by STDIN,  which is dealt with
# after some_data.txt
> ls -l | awk '{print}' some_data.txt - more_data.txt
> ... # prints all lines from some_data.txt
> ... # prints result of ls -l (this is the "-")
> ... # prints all lines from more_data.txt
>
# pass some text into awk, then run an awk program on it
> echo '1 2 3' | awk '{print}'
> 1 2 3
>
# using the curl util to download a csv file, piping it to awk, and running
# the simple awk program on it
> curl http://is.gd/eUrbOZ | awk '{print}'
Forename,Surname,Description on ballot paper,Constituency Name,PANo,Votes,Share
... # etc

Anatomy of an awk program

So far the awk example consisted of simple one liners - but awk programs can consist of several instructions (“actions”). You can still write them out on the shell:

# note: the ">" is added automatically when hitting return inside a '',
# and the space between > and { was added manually to make it line up
> awk '{print}
>      {print}
>      {print}'
# now that the closing ' was typed, awk kicks in. this programs simply
# prints out whatever you type three times
oh # typed by you
oh # printed by awk 3 times
oh
oh

In this tutorial I will put the awk program in its own file and load it from the command line - just to make formatting easier and allow comments. The file loaded here has suffix “.awk” but that’s irrelevant, it could be any filename.

An awk program consists of a list of actions, one after the other, and typically one per line (they can be broken up though). There are two special types of actions - BEGIN actions are executed only once, before the text is scanned, and END only once, afterwards. All other actions are executed in order on every line of text. Assume your input file includes increasing integers, one per line

BEGIN { print "START!" }
{print "--------------"}
{print}
{print}
{print}
END { print "END!" }

START!
--------------
1
1
1
--------------
2
2
2
--------------
3
3
3
END!

Note that actions can be in any order (they will be executed in the order they are written) and there can be multiple BEGIN and END, so the following is also a legal program.

{print "--------------"}
{print}
END { print "END!" }
BEGIN { print "START!" }
{print}
END { print "Copyright 2005" }
{print}

Inside the actions awk offers what most programming languages offer - variable, loops, tests, etc.

Actions formatting

Awk follows Unix conventions on most things, so in case of doubt whatever works in Bash scripts tends to work.

# a 'normal' one lne
BEGIN { print "START" }
# you can add newlines for formatting - this is equivalent to the above
BEGIN {
  print "START"
}
-> START
-> START

# as in Bash scripts, you can use the semicolon to separate multiple statements
# on the same line...
BEGIN { print "STA"; print "RT";}
# or you can write them one per line, with or without semicolon
BEGIN { print "STA"
        print "RT" }
-> STA
  RT
-> STA
  RT

Variables

Awk makes several variables available to the programs - some are loaded when the program is launched, some are updated with each line read, some are created by the program itself.

Field Variables

Whenever awk reads a line, it splits it into “fields” by white-space / tab (this is the default and can be overridden), Then each field is copied to a variable $1, $2, …. in order - there is no limit. Additionally, $0 contains the whole line.

# assume this file
1 2 3 4 5 6 7

# the following two lines are equivalent
{ print }
{ print $0 }
-> 1 2 3 4 5 6 7
-> 1 2 3 4 5 6 7

# only prints some fields we are interested in
{ print $1 " " $3 }
-> 1 3

The field number doesn’t have to be a constant - it can be an expression or a variable. For example, the global variable NF contains the number of the last field and is updated which every line read. So if there are 7 fields, NF would be 7, and $NF would be $7, i.e. the last field

# assume this file
1 2 3 4 5 6 7

# both mean first and last field - but the first version only works if there are
# 7 fields, the second always works
{print $1 " " $7}
{print $1 " " $NF}
-> 1 7
-> 1 7
#
# print the last two fields
{print $(NF-1) " " $NF}
-> 6 7

Another useful global variable that gets updated for each record is NR - this is the record number

# feed a four line input into awk
> echo 'a
> b
> c
> d' | awk '{print NR ") " $1}'
# it prints the line number, ), and the first (and only) field
1) a
2) b
3) c
4) d

You can assign to a field variable with the ‘=’ operator, thereby changing the record content:

# adding something to a field - will only works if it's a number
{$3 = ($3 + 100)
# now print the updated line
print $0}
-> 1 2 103 4 5 6 7

If you assign to a field variable that doesn’t exist, it will be added to the record

# the record only contains $1 and $2;
> echo 1 2 | awk '{print $0}'
> 1 2
>
# the program adds two new fields
> echo 1 2 | awk '{$3 = 3; $4 = 4; print $0}'
> 1 2 3 4

Global variables

A few variables are set when the program is launched. Here’s a very short list - if you need to play with these you probably want to get yourself a book on awk.

User defined variables

To create your own variable, just start assigning to them with the ‘=’ operator - awk will initialize them to an emptry string (which becomes a 0 if used in numeric context). The type of variable is dynamic and can vary during its lifetime.

In the example below, awk is used on the ls command to find the total size of a folder.

Variable
ARGV	array of command line arguments
ARGC	number of command line arguments
ENVIRON	associative array with environment. Depends on system
FILENAME	self explanatory

# ls -la returns listings in the form:
# -rw-rw-r--    1 gotofritz staff   1513 Dec 15  2013 .bash_profile
# awk simply collects each filesize and adds it to a running total,
# then prints it at the end
> ls -la | awk '    { total += $5 }
                END { print total }'
-> 158448

Arrays

Awk has associative arrays, similar to PHP’s or Javascript. You create an array by using it, no need to initialize it.

# assume this file
10 Life changes fast
20 Life changes in the instant
30 You sit down to dinner and life as you know it ends
40 The question of self-pity

# creates my_array and inserts all lines into it
{ my_array[$1] = $2 }
# creates - note that array is sparse
my_array[10] = "Life changes fast"
my_array[20] = "Life changes in the instant"
my_array[30] = "You sit down to dinner and life as you know it ends"
my_array[40] = "The question of self-pity"

# string keys are also possible
{ my_array["name"] = "Homer" }

One thing that is different in awk is that multidimensional arrays use a single set of square brackets to wrap both indices

# assume this file
dad homer
mum marge
son bart

# creates a two dimensional array
{ family["simpsons",$1] = $2 }
-> creates
family["simpsons","dad"] = "homer"
family["simpsons","mum"] = "marge"
family["simpsons","son"] = "bart"

Note that arrays in awk are pretty awkward. There are no built in functions to deal with them except for the for … in loop. If you need to sort, or even just find out the length, you’ll have to write your own functions. Alternatively, use gawk which has better array handling.

Regular expressions

A regular expression (regexp) is a mini programming language which is used to describe variable strings; it is embedded in most programming languages. Regexps are enclosed in slashes and use a combination of literal characters and punctuation to describe strings. The operator ~ is used to match a regexp, and !~ to ensure it is not matched.

Regular expressions is a complicated topic of its own; here is just a quick introduction

# print all lines with "gmail" in the 1st field
{ if ($1 ~ /gmail/) print}

# prints all lines EXCEPT those with "gmail"
{ if ($1 !~ /gmail/) print}

# ^ indicates start of string.
# this matches "tom" "tomato" but not "atom"
{ if ($1 ~ /^tom/) print}

# $ indicates end of string
# this matches "tom", "atom" but not "tomato"
{ if ($1 ~ /tom$/) print}

# this matches "tom", not "atom" and not "tomato"
{ if ($1 ~ /^tom$/) print}

# . matches any character.
# this matches "bear" "boar" but not "bar"
{ if ($1 ~ /b..r/) print}

# [ABC] matches one character from the set "A", "B", "C"
# this matches "boar" "bear" but not "blar"
{ if ($1 ~ /b[oe]ar/) print}

# [^ABC] matches one character which is anything except "A", "B", "C"
# this matches "blar" but neither "boar" nor "bear"
{ if ($1 ~ /b[^oe]ar/) print}

# (abc) groups the expression abc as a unit.
# | is an "or"
# \ is used to scape special characters, i.e. treat them as normal characters
# in this case we want to treat the '.' as a period and not "any character"
# the following matches @gmail.com or @yahoo.com
{ if ($1 ~ /@(gmail)|(yahoo)\.com/) print}

# * means repeat zero or more. + is repeat once or more. ? is repeat 0 or 1
{ if ($1 ~ /<[^>]+>[^<]*</[^>]+>\.?/) print }
# the following matches < followed by one or more (+) of anything except >, then >
# then zero or more (*) of anything except <
# then </ followed by one or more (+) of anything except >, then >
# then an optional .

Statements, operators, and function

Control statements

Awk has the usual loops and conditionals familiar from C. Braces are optional for single nested statements

# braces are optional for single statements
for (name in list_of_names)
  print name

for (capital_city in country) {
  print capital_city
}

# but needed for multiple statements
if (NR % 2 == 0) {
  $2 = $1 * 2
  print $0
}

if-else

Awk doesn’t have booleans. Instead it treats the number 0 or the empty string “” as false, and any other value (including the string “0”) as true. The comparison operators are the familiar ones, with double equal sign for equality, plus the tilde ~ and !~ for regular expression matching, and “in” for array existence

{ if ($1 == "full") ... }
{ if ($2 < 0.5) ... }
{ if ($0 ~ /Republican/) print $0 } ... # matches regexp
{ if ($1 !~ /Completed/) print $0 } ... # rejects regexp
{ if (capital_city in country) print country["capital_city"] }

loops

Awk has both for and while loops (including do-while). Additionally, there is the for-in loop for sparse arrays

# assume file
1 10 100
2 20 200

# both these programs will print each line with fields back-to-front
# while loop version...
{ i = NF
  line = ""
  while (i) {
    line =  line " " $i
    i--
  }
  print line
}
# for loop version
{ line = ""
  for (i=NF; i>0; i--) {
    line =  line " " $i
  }
  print line
}
-> 100 10 1
  200 20 2

# puts each line of input into the array
{ lines[$NR] = $0 }
# at end prints all the lines
END {
  for (line in lines)
    print line
}

break and continue statements are available to exit a loop prematurely or skipping an iteration respectively.

{ if ($5 == "") next }
{ print $5 $4 }

Awk numeric operators

The usual maths operators can be used: +, -, /, * , ++, – plus % for modulus, ^ for exponentiation. Unary + converts to a number

echo "1
> 2
> 3
> 4" | awk '{print $1 ^ 2}'
1
4
9
16

String concatenation

Concatenating string in awk is slightly weird. There is no string concatenation operator; just put the strings next to each other. Because of that it is recommended to use parenthesis except for trivial cases. Alternatively, print can take multiple comma separated arguments - and they will be printed with a space separating them

# assume this file
1
2
3
4

# the strings ($1..) and "a" are concatenated (no space between them) and
# the resulting string is passed to print
{print ($1+2) "a"}
3a
4a
5a
6a

# two separate strings are passed to print -  a space is put between them
{print ($1+2), "a"}
3 a
4 a
5 a
6 a

# string concatenation works for variables too
{ something = $1 "--"
  print something }
1--
2--
3--
4--

Built in functions

There are a number of built in functions: numeric ones like cosine, square root, random; string functions like print or string length; time functions and bitwise functions. You can easily find out what they are by looking at the output of man awk.

Worth nothing that besides print awk also offers printf, i.e. “print formatted”. Printf is common to many Unix tools and languages. You give a string with some placeholders and rules, and then you pass variables to “plug in” those placeholders. The important thing is the rules, which control things like right alignment, decimal precision, zero padding for numbers, etc. A statement looks like this:

{ printf "%-10s %04.3f%% \n", $1, $2 }
# placeholders start with %
# %-10s is a string (s), and is left aligned (-) within a field 10 spaces wide (10)
# %07.3f is a decimal number or float (f), the toal length has to be at least 7
#        characters (7) and is padded with zeroes if too small (0) and has
#        3 decimals (3) and minimum 3 in the integer part (7 - 3 decimals - the point)
# %% if you want to print an actual %, you need to type it twice %%
# \n you need to supply the new line manually

User defined functions

You can define functions anywhere in your code, outside actions. They are pretty similar to Javascript

# define funtion outside rules - could be at the bottom of file
function my_func(field_content) {
  print "FIELD: " field_content
}

# now use in rules
{my_func($1)}

Patterns

Previously I described an awk program as a series of actions, with the special case of BEGIN and END. That’s not entirely correct. An awk program consists of a sequence of actions and optional patterns; BEGIN and END are two special patterns. Incidentally, there is also a BEGINFILE and ENDFILE, for when processing more than one file at the time.

BEGIN and END are special because they identify actions which are not executed for every line of input, but before or after the whole program is run. The other patterns are used on every line to determine whether the action should be run for that particular line or not. Patterns are espressions that return false (i.e., 0 or ““) or true (anything else). When the pattern returns true, the rules is executed.

Regular expression patterns

Regular expressions can be used as pattern; they match the entire line. An exclamation mark reverses the match. Boolean operators can be used to combine patterns

# print lines with an email address
# (very lazy match - will only work if all email addresses are well formed)
/@/ { print $3}

# prints all lines except those with a gmail address
! /@gmail\./ { print $0 }

# prints lines with an @ and the sequence 0160
/@/ && /0160/ { print }

The regular expressions above are a shortcut for $0 ~ /pattern/, i.e. “apply the regexp to whole line”. Similar rules can be made for individual fields…

# matches only the regx on one field
$1 ~ /Anthony/ { print }

# print even lines
NR % 2 == 0 { print }

# print only if length of 1st field is greater than 3
# length is a string function mentioned above
length($1) > 3 { print }

The reason we have been able to run program without patterns is because there is a special pattern, the empty pattern, which matches every line. In fact we could have a program which is just a pattern; the default action {print} would be executed.

# prints whole line, default action
$1 == "complete"

Splitting records and fields differently from default

By default awk treats each line as a record. In reality what it does is to split the input by a record separator, stored in the variable RS, which happens to be the new line character. You can change that in an awk program.

# separate records by semicolon
> echo "1 2 3;4 5 6;7 8 9" | awk 'BEGIN {RS = ";" }
>                                 {print}'
1 2 3
4 5 6
7 8 9

Something similar is possible with the field separator, which is stored in the variable FS. By default it is equal to the regexp [ \t\n]+, i.e. any number of consectuve spaces of any type. Note that in reality awk cheats - leaving FS default doesn’t just mean setting it to [ \t\n]+, but also trimming $0 of leading and trailing empty space before processing it.

# separate fields by comma
> echo "1,2,3
4,5,6
7,8,9" | awk 'BEGIN {FS = "," }
>             {print}'
1,2,3
4,5,6
7,8,9

You can combine the two together if, for example, your data has one field per line and records are separated by multiple lines - an empty RS means “any number of consecutive \n””

# assume this data
homer simpson
dad

marge simpson
mum

# separate records by any number of newlines, and have one field per line
BEGIN {RS=""; FS="\n"}
{ print $1 " (" $2 ")" }

-> homer simpson (dad)
  marge simpson (mum)

Passing option to awk

A field separator (but not a record separator) can be also passed to an awk program in two ways. First of all, awk has a special option for it, -F (note that there is no space between it and the separator). And awk allow passing of variables with the -v syntax, so you could just pass FS that way.

# change separator from within program
BEGIN {FS = "," }

# pass separator with special option -F - note that you don't need quotes
> echo "1,2,3
> 4,5,6" | awk -F, '{print}'
1,2,3
4,5,6

# pass separator as external var with -v
> echo "1,2,3
> 4,5,6" | awk -v FS="," '{print}'
1,2,3
4,5,6i

# in fact you can pass any variable of your choice with -v
> echo "" | awk -v WHAT="grow up" '{print "All children, except one, " WHAT}'
All children, except one, grow up

Reading CSV files in awk

The naive approach would be to simply set FS=“,” - but that doesn’t cover the fact that some fields are surrounded by quotation marks and others aren’t, and sometimes you have newlines and / or commas inside a field. Here are some examples scripts people have put together to solve the issues. They are also good examples of fairly complex awk scripts.

Personally I think that’s taking things too far - if you have to force awk to create arrays to store manipulated record fragments you may as well use a fully fledged scripting language.

Input data format

One area where standard awk falls short is dealing with input files from your standard desktop applications - basically, CSV files from Office. There hasn’t been a CSV standard until recently, and the CSV generated by MS Office doesn’t work too well with awk. The main issue is that CSV is basically a rubbish format which suffers from a few problems: the comma too common a character to be used as a separator (why didn’t they choose tab??), newlines are saved without being converted to a safe sequence, sometimes fields are surrounded by quotes and sometimes they aren’t.

But if you have a “well behaved” CSV file, i.e. one which doesn’t have commas, quotation marks, or new lines inside fields, e.g UK,London,10000 then you can easily process it by passing -F”,” to the awk call: awk -F"," -f my_awk_script.awk some_input_data.txt

In practice unless you generate the data yourself there is always going to be the odd comma or quotation mark in your data somewhere; the safest and most reliable course of action is to use tab as separator. I use a free online CSV to TSV converter like this. I then call awk with “\t” as the separator.

The command below is what I use - my awk program is in my_awk_script.awk, the data is in uk_electoral_data_converted.csv, and the results goes into awk_output.txt.

awk -F"\t" -f my_awk_script.awk uk_electoral_data_converted.csv > awk_output.txt

Sample data: UK election results

Skipping header row

# put NR > 1 in front of every action to skip the header row
NR > 1 { print }

# result
Gerald  Howarth The Conservative Party Candidate  Aldershot ...
...

Skipping empty records

# check if the first field is empty
NF { print }

Rearranging fields, and skipping some

NR > 1 && NF { print $4 ": " $2 " " $1 " (" $NF ") " $7 "% " }
## NR > 1    ignore header
## NF  ignore empty record
# print constituency name, name surname, party abbreviation, share of the vote
# ignore other fields

Aldershot: Howarth Gerald (Con) 50.6%
Aldershot: Puffett Gary (Lab) 18.3%
Aldershot: Walker Bill (UKIP) 17.9%
Aldershot: Hilliar Alan (LD) 8.8%
Aldershot: Hewitt Carl (Green) 4.4%
...

Print sum of all fields

$NF == "Con" {total += $6}
END          {print total}
# $NF == "Con"  if a record concerns a tory vote
# {total += $6} add it to a running total
## END           when all records are processed
# {print total} output total

11299609

Work out column averages

# keep a running total
NR > 1 && NF  {total += $6}
# keep a total for each party - don't do anything yet
$NF == "Con"        {total_con += $6}
$NF == "Lab"        {total_lab += $6}
$NF == "UKIP"       {total_ukip += $6}
$NF == "LD"         {total_ld += $6}
$NF == "Green"      {total_green += $6}
$NF == "SNP"        {total_snp += $6}
# print a report at the end
END {
  print "TOTAL: " total
  print "Con:   " (100 * total_con / total) "%"
  print "Lab:   " (100 * total_lab / total) "%"
  print "UKIP:  " (100 * total_ukip / total) "%"
  print "LD:    " (100 * total_ld / total) "%"
  print "SNP:   " (100 * total_snp / total) "%"
  print "Green: " (100 * total_green / total) "%"
}

# output:
TOTAL: 30697255
Con:   36.8098%
Lab:   30.449%
UKIP:  12.6431%
LD:    7.87014%
SNP:   4.738%
Green: 3.77112%

Using functions

# abstracting copy-and-paste code into a function
function print_party_percentage(party_name, party_vote, total_vote) {
  print party_name "   " (100 * party_vote / total_vote) "%"
}
# same program as before
NR > 1 && NF   {total += $6}
$NF == "Con"   {total_con += $6}
$NF == "Lab"   {total_lab += $6}
$NF == "UKIP"  {total_ukip += $6}
$NF == "LD"    {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP"   {total_snp += $6}
END {
  print "TOTAL: " total
  print_party_percentage("Con", total_con, total)
  print_party_percentage("Lab", total_lab, total)
  print_party_percentage("UKIP", total_ukip, total)
  print_party_percentage("LD", total_ld, total)
  print_party_percentage("SNP", total_snp, total)
  print_party_percentage("Green", total_green, total)
}

# output - looks messier because previous program was manually formatted
TOTAL: 30697255
Con   36.8098%
Lab   30.449%
UKIP   12.6431%
LD   7.87014%
SNP   4.738%
Green   3.77112%

Formatting numerical precision and alignment with printf

function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
  # %5s a string (s) of fixed width 5 or more (5) aligned right (if it was -5 it would be left)
  # %4.1f a number (f) with one decimal (.1) and total width 4 (4) aligned right (4)
  # %% an actual %
}
# same program as before
NR > 1 && NF   {total += $6}
$NF == "Con"   {total_con += $6}
$NF == "Lab"   {total_lab += $6}
$NF == "UKIP"  {total_ukip += $6}
$NF == "LD"    {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP"   {total_snp += $6}
END {
  print "TOTAL: " total
  print_party_percentage("Con", total_con, total)
  print_party_percentage("Lab", total_lab, total)
  print_party_percentage("UKIP", total_ukip, total)
  print_party_percentage("LD", total_ld, total)
  print_party_percentage("SNP", total_snp, total)
}

# output
TOTAL: 30697255
  Con: 36.8%
  Lab: 30.4%
UKIP: 12.6%
  LD:  7.9%
  SNP:  4.7%

Using arrays to group data

There is still some copy and paste code because we are hardcoding the parties. We can use arrays to group whatever parties we find.

# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}

# skip empty and header lines
NR > 1 && NF  {
  # runnint total
  total += $6
  # create or update running total for current party
  party_totals[$NF] += $6
}
# when all records are processed
END {
  print "TOTAL: " total
  # print a line for each party
  for (party in party_totals)
    print_party_percentage(party, party_totals[party], total)
}

# output - there are LOTS of tiny local parties
TOTAL: 30697255
  UUP:  0.4%
Left Unity - Trade Unionists and Socialists:  0.0%
  IZB:  0.0%
Respect:  0.0%
  SSP:  0.0%
  NSW:  0.0%
The 30-50 Coalition:  0.0%
.... and so on

Oh - turns out if you include all the novelty parties there are 132 of them across the UK. We need to sort the array and only print the top X items. Turns out it is quite complicated.

Sorting array

Standard awk’s array are not sortable. This was a design choice - only associative arrays are supported, so there is no order, hence they can’t be sorted in any meaningful way. gawk, however, has two array sorting functions - how do they do it? They actually create a new associative array, with all the values from the original but none of they keys; they keys are replaced by new ones, in order. Then you use a for loop (not the standard for in) to read all the array “in order”. This is all well and good if you don’t need the keys, but I do (they are the name of the party). Besides, I am using awk and not gawk.

The best approach is to create a new array with just the keys, sort that array, and then loop through it in order to find out which keys of the original array to read.

# kickstarts the sort process
# puts all the sorted keys into a separate array. if i
function homebrew_asort(original, processed) {
  # before we use the array we must be sure it is empty
  empty_array(processed)
  original_length = copy_and_count_array(original, processed)
  qsort(original, processed, 0, original_length)
  return original_length
}

# removes al values
function empty_array(A) {
  for (i in A)
    delete A[i]
}

# awk doesn't even have an array size function... you also have to roll out your own
function copy_and_count_array(original, processed) {
  for (key in original) {
    # awk doesn't seem to like array[0] -  so we start from 1
    size++
    processed[size] = key
  }
  return size
}

## Adapted from a script from awk.info
# http://awk.info/?quicksort
function qsort(original, keys, left, right,   i, last) {
  if (left >= right)  return
  swap(keys, left, left + int( (right - left + 1) * rand() ) )
  last = left
  for (i = left+1; i <= right; i++)
    if (original[keys[i]] < original[keys[left]])
      swap(keys, ++last, i)
  swap(keys, left, last)
  qsort(original, keys, left, last-1)
  qsort(original, keys, last+1, right)
}
function swap(A, i, j,   t) {
  t = A[i]; A[i] = A[j]; A[j] = t
}

# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
  printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}

# same main action as before
NR > 1 && NF  {
  total += $6
  party_totals[$NF] += $6
}

# when all records are processed
END {
  parties_count = homebrew_asort(party_totals, keys)
  for  (i = parties_count; i >= parties_count - 5; i--)
    print_party_percentage(keys[i], party_totals[keys[i]], total)
}

Print first or last x characters of a line

# head equivalent
$ awk '{print substr($0, 1, 32)}' xxx
$ head -c 32 xxx

# tail equivalent
$ awk 'END {print substr($0, length($0) - 30, 32)}' xxx
$ tail -c 32 xxx

# no head or tail equivalent - print characters 32 to 64
$ awk '{print substr($0, 32, 32)}' xxx

Other examples

Learning more about awk

Awk Print Examples

In this article, let us review the fundamental awk working methodology along with 7 practical awk print examples.

Awk Introduction and Printing Operations

Awk is a programming language which allows easy manipulation of structured data and the generation of formatted reports. Awk stands for the names of its authors “Aho, Weinberger, and Kernighan”

The Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then perform associated actions.

Awk reads from a file or from its standard input, and outputs to its standard output. Awk does not get along with non-text files.

awk '/search pattern1/ {Actions}
     /search pattern2/ {Actions}' file

Awk Working Methodology

Let us create employee.txt file which has the following content, which will be used in the
examples mentioned below.

$cat employee.txt
100  Thomas  Manager    Sales       $5,000
200  Jason   Developer  Technology  $5,500
300  Sanjay  Sysadmin   Technology  $7,000
400  Nisha   Manager    Marketing   $9,500
500  Randy   DBA        Technology  $6,000

Awk Example 1. Default behavior of Awk

$ awk '{print;}' employee.txt
100  Thomas  Manager    Sales       $5,000
200  Jason   Developer  Technology  $5,500
300  Sanjay  Sysadmin   Technology  $7,000
400  Nisha   Manager    Marketing   $9,500
500  Randy   DBA        Technology  $6,000

In the above example pattern is not given. So the actions are applicable to all the lines.
Action print with out any argument prints the whole line by default. So it prints all the
lines of the file with out fail. Actions has to be enclosed with in the braces.

Awk Example 2. Print the lines which matches with the pattern.

$ awk '/Thomas/
> /Nisha/' employee.txt
100  Thomas  Manager    Sales       $5,000
400  Nisha   Manager    Marketing   $9,500

In the above example it prints all the line which matches with the ‘Thomas’ or ‘Nisha’. It has two patterns. Awk accepts any number of patterns, but each set (patterns and its corresponding actions) has to be separated by newline.

Awk Example 3. Print only specific field.

Awk has number of built in variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables. If the line has 4 words, it will be stored in $1, $2, $3 and $4. $0 represents whole line. NF is a built in variable which represents total number of fields in a record.

$ awk '{print $2,$5;}' employee.txt
Thomas $5,000
Jason $5,500
Sanjay $7,000
Nisha $9,500
Randy $6,000

$ awk '{print $2,$NF;}' employee.txt
Thomas $5,000
Jason $5,500
Sanjay $7,000
Nisha $9,500
Randy $6,000

In the above example $2 and $5 represents Name and Salary respectively. We can get the Salary using $NF also, where $NF represents last field. In the print statement ‘,’ is a concatenator.

Awk Example 4. Initialization and Final Action

Awk has two important patterns which are specified by the keyword called BEGIN and END.

BEGIN {Actions}
{ACTION} ## Action for everyline in a file
END {Actions}

## is for comments in Awk

Actions specified in the BEGIN section will be executed before starts reading the lines from the input.
END actions will be performed after completing the reading and processing the lines from the input.

$ awk 'BEGIN {print "Name\tDesignation\tDepartment\tSalary";}
> {print $2,"\t",$3,"\t",$4,"\t",$NF;}
> END{print "Report Generated\n--------------";
> }' employee.txt
Name    Designation Department  Salary
Thomas   Manager     Sales           $5,000
Jason    Developer   Technology      $5,500
Sanjay   Sysadmin    Technology      $7,000
Nisha    Manager     Marketing   $9,500
Randy    DBA         Technology      $6,000
Report Generated
--------------

Awk Example 5. Find the employees who has employee id greater than 200

$ awk '$1 >200' employee.txt
300  Sanjay  Sysadmin   Technology  $7,000
400  Nisha   Manager    Marketing   $9,500
500  Randy   DBA        Technology  $6,000

In the above example, first field ($1) is employee id. So if $1 is greater than 200, then just do the default print action to print the whole line.

Awk Example 6. Print the list of employees in Technology department

Now department name is available as a fourth field, so need to check if $4 matches with the string “Technology”, if yes print the line.

$ awk '$4 ~/Technology/' employee.txt
200  Jason   Developer  Technology  $5,500
300  Sanjay  Sysadmin   Technology  $7,000
500  Randy   DBA        Technology  $6,000

Operator ~ is for comparing with the regular expressions. If it matches the default action i.e print whole line will be performed.

Awk Example 7. Print number of employees in Technology department

The below example, checks if the department is Technology, if it is yes, in the Action, just increment the count variable, which was initialized with zero in the BEGIN section.

$ awk 'BEGIN { count=0;}
$4 ~ /Technology/ { count++; }
END { print "Number of employees in Technology Dept =",count;}' employee.txt
Number of employees in Tehcnology Dept = 3

Then at the end of the process, just print the value of count which gives you the number of employees in Technology department.

Awk

Why Should Learn Just a Little Awk: An Awk tutorial by Example