Sep 29, 2010, Greg Grothaus | Link
In grad school, I once saw a prof I was working with grab a text file and in seconds manipulate it into little pieces so deftly it blew my mind. I immediately decided it was time for me to learn awk, which he had so clearly mastered.
To this day, 90% of the programmers I talk to have never used awk. Knowing 10% of awk’s already small syntax, which you can pick up in just a few minutes, will dramatically increase your ability to quickly manipulate data in text files. Below I’ll teach you the most useful stuff - not the “fundamentals”, but the 5 minutes worth of practical stuff that will get you most of what I think is interesting in this little language.
Awk is a fun little programming language. It is designed for processing input strings. A (different) prof once asked my networking class to implement code that would take a spec for an RPC service and generate stubs for the client and the server. This professor made the mistake of telling us we could implement this in any language. I decided to write the generator in Awk, mostly as an excuse to learn more Awk. Surprisingly to me, the code ended up much shorter and much simpler than it would have been in any other language I’ve ever used (Python, C++, Java, …). There is enough to learn about Awk to fill half a book, and I’ve read that book, but you’re unlikely to be writing a full-fledged spec parser in Awk. Instead, you might want to do things like find all of your log lines that come from ip addresses whose components sum up to 666, for kicks and grins. Read on!
For our examples, assume we have a little file (logs.txt) that looks like the one below. If it wraps in your browser, this is just 2 lines of logs each starting with an ip address.
07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"
123.125.71.19 [28/Sep/2010:04:20:11] "GET / HTTP/1.1" 304 - "Baiduspider"
These are just two log records generated by Apache, slightly simplified, showing Bing and Baidu bots wandering around on my site.
Awk works like anything else (ie: grep) on the command line. It reads from stdin and writes to stdout. It’s easy to pipe stuff in and out of it. The command line syntax you care about is just the command awk followed by a string that contains your program.
awk '{print $0}'Most Awk programs will start with a “{” and end with a “}”. Everything in between there gets run once on each line of input. Most awk programs will print something. The program above will print the entire line that it just read, print appends a newline. $0 is the entire line. So this program is an identity operation - it copies the input to the output without changing it.
Awk parses the line in to fields for you automatically, using any whitespace (space, tab) as a delimiter, merging consecutive delimiters. Those fields are available to you as the variables $1, $2, $3, etc.
echo 'this is a test' | awk '{print $3}' // prints 'a'
awk '{print $1}' logs.txt
# Output:
# 07.46.199.184
# 123.125.71.19Easy so far, and already useful. Sometimes I need to print from the end of the string though instead. The special variable, NF, contains the number of fields in the current line. I can print the last field by printing the field $NF or I can just manipulate that value to identify a field based on it’s position from the last. I can also print multiple values simultaneously in the same print statement.
echo 'this is a test' | awk '{print $NF}' // prints "test"
awk '{print $1, $(NF-2) }' logs.txt
# Output:
# 07.46.199.184 200
# 123.125.71.19 304More progress - you can see how, in moments, you could strip this log file to just the fields you are interested in. Another cool variable is NR, which is the row number being currently processed. While demonstrating NR, let me also show you how to format a little bit of output using print. Commas between arguments in a print statement put spaces between them, but I can leave out the comma and no spaces are inserted.
awk '{print NR ") " $1 " -> " $(NF-2)}' logs.txt
# Output:
# 1) 07.46.199.184 -> 200
# 2) 123.125.71.19 -> 304Powerful, but nothing hard yet, I hope. By the way, there is also a printf function that works much the way you’d expect if you prefer more formatting options. Now, not all files have fields that are separated with whitespace. Let’s look at the date field:
awk '{print $2}' logs.txt
# Output:
# [28/Sep/2010:04:08:20]
# [28/Sep/2010:04:20:11]The date field is separated by “/” and “:” characters. I can do the following within one awk program, but I want to teach you simple things that you can string together using more familiar unix piping because it’s quicker to pick up a small syntax. What I’m going to do is pipe the output of the above command through another awk program that splits on the colon. To do this, my second program needs two {} components. I don’t want to go into what these mean, just to show you how to use them for splitting on a different delimiter.
awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}'
# Output:
# [28/Sep/2010
# [28/Sep/2010I just specified that I wanted a different FS (field separator) of “:” and that I wanted to then print the first field. No more time, just dates! The simplest way to get rid of that prefix [ character is with sed, which you are likely already familiar with:
awk '{print $2}' logs.txt | awk 'BEGIN{FS=":"}{print $1}' | sed 's/\[//'
# Output:
# 28/Sep/2010
# 28/Sep/2010I can further split this on the “/” character if I want using the exact same trick, but I think you get the point. Next, lets learn just a tiny bit of logic. If I want to return only the 200 status lines, I could use grep, but I might end up with an ip address that contains 200, or a date from year 2000. I could first grab the 200 field with Awk and then grep, but then I lose the whole line’s context. Awk supports basic if statements. Lets see how I might use one:
awk '{if ($(NF-2) == "200") {print $0}}' logs.txt
# Output:
# 07.46.199.184 [28/Sep/2010:04:08:20] "GET /robots.txt HTTP/1.1" 200 0 "msnbot"There we go, returning only the lines (in this case only one) with a 200 status. The if syntax should be very familiar and require no explanation. Let me finish up by showing you one stupid example of awk code that maintains state across multiple lines. Lets say I want to sum up all of the status fields in this file. I can’t think of a reason I’d want to do this for statuses in a log file, but it makes a lot of sense in other cases like summing up the total bytes returned across all of the logs in a day or something. To do this, I just create a variable which automatically will persist across multiple lines:
awk '{a+=$(NF-2); print "Total so far:", a}' logs.txt
# Output:
# Total so far: 200
# Total so far: 504Nothing doing. Obviously in most cases, I’m not interested in cumulative values but only the final value. I can of course just use tail -n1, but I can also print stuff after processing the final line using an END clause:
awk '{a+=$(NF-2)}END{print "Total:", a}' logs.txt
# Output:
# Total: 504If you want to read more about awk, there are several good books and plenty of online references. You can learn just about everything there is to know about awk in a day with some time to spare. Getting used to it is a bit more of a challenge as it really is a little bit different of a way to code - you are essentially writing only the inner part of a for loop. Come to think of it, this is a lot like how MapReduce feels, which is also initially disorienting.
I hope some of that was useful. If you found it to be so, let me know; I enjoy the feedback if nothing else.
Daniel Robbins | Updated October 11, 2013 - Published December 1, 2000 | Link
In this series of articles, I’m going to turn you into a proficient awk coder. I’ll admit, awk doesn’t have a very pretty or particularly “hip” name, and the GNU version of awk, called gawk, sounds downright weird. Those unfamiliar with the language may hear “awk” and think of a mess of code so backwards and antiquated that it’s capable of driving even the most knowledgeable UNIX guru to the brink of insanity (causing him to repeatedly yelp “kill -9!” as he runs for coffee machine).
Sure, awk doesn’t have a great name. But it is a great language. Awk is geared toward text processing and report generation, yet features many well-designed features that allow for serious programming. And, unlike some languages, awk’s syntax is familiar, and borrows some of the best parts of languages like C, python, and bash (although, technically, awk was created before both python and bash). Awk is one of those languages that, once learned, will become a key part of your strategic coding arsenal.
Let’s go ahead and start playing around with awk to see how it works. At the command line, enter the following command:
awk '{ print }' /etc/passwdYou should see the contents of your /etc/passwd file appear before
your eyes. Now, for an explanation of what awk did. When we called awk,
we specified /etc/passwd as our input file. When we executed awk, it
evaluated the print command for each line in /etc/passwd, in order. All
output is sent to stdout, and we get a result identical to catting
/etc/passwd. Now, for an explanation of the { print } code
block. In awk, curly braces are used to group blocks of code together,
similar to C. Inside our block of code, we have a single print command.
In awk, when a print command appears by itself, the full contents of the
current line are printed.
Here is another awk example that does exactly the same thing:
awk '{ print $0 }' /etc/passwdIn awk, the $0 variable represents the entire current
line, so print and print $0 do exactly the
same thing. If you’d like, you can create an awk program that will
output data totally unrelated to the input data. Here’s an example:
awk '{ print "" }' /etc/passwdWhenever you pass the “” string to the print command, it prints a blank line. If you test this script, you’ll find that awk outputs one blank line for every line in your /etc/passwd file. Again, this is because awk executes your script for every line in the input file. Here’s another example:
awk '{ print "hiya" }' /etc/passwdRunning this script will fill your screen with hiya’s. 🙂
Awk is really good at handling text that has been broken into multiple logical fields, and allows you to effortlessly reference each individual field from inside your awk script. The following script will print out a list of all user accounts on your system:
awk ‑F":" '{ print $1 }' /etc/passwdAbove, when we called awk, we use the -F option to specify “:” as the
field separator. When awk processes the print $1 command,
it will print out the first field that appears on each line in the input
file. Here’s another example:
awk ‑F":" '{ print $1 $3 }' /etc/passwdHere’s an excerpt of the output from this script:
halt7
operator11
root0
shutdown6
sync5
bin1
....etc.As you can see, awk prints out the first and third fields of the
/etc/passwd file, which happen to be the username and uid fields
respectively. Now, while the script did work, it’s not perfect — there
aren’t any spaces between the two output fields! If you’re used to
programming in bash or python, you may have expected the
print $1 $3command to insert a space between the two
fields. However, when two strings appear next to each other in an awk
program, awk concatenates them without adding an intermediate space. The
following command will insert a space between both fields:
awk ‑F":" '{ print $1 " " $3 }' /etc/passwdWhen you call print this way, it’ll concatenate $1,
" ", and $3, creating readable output. Of
course, we can also insert some text labels if needed:
awk ‑F":" '{ print "username: " $1 "\t\tuid:" $3 }' /etc/passwdThis will cause the output to be:
username: halt uid:7
username: operator uid:11
username: root uid:0
username: shutdown uid:6
username: sync uid:5
username: bin uid:1
....etc.Passing your scripts to awk as a command line argument can be very handy for small one-liners, but when it comes to complex, multi-line programs, you’ll definitely want to compose your script in an external file. Awk can then be told to source this script file by passing it the -f option:
awk ‑f myscript.awk myfile.inPutting your scripts in their own text files also allows you to take advantage of additional awk features. For example, this multi-line script does the same thing as one of our earlier one-liners, printing out the first field of each line in /etc/passwd:
BEGIN {
FS=":"
}
{ print $1 }The difference between these two methods has to do with how we set the field separator. In this script, the field separator is specified within the code itself (by setting the FS variable), while our previous example set FS by passing the -F”:” option to awk on the command line. It’s generally best to set the field separator inside the script itself, simply because it means you have one less command line argument to remember to type. We’ll cover the FS variable in more detail later in this article.
Normally, awk executes each block of your script’s code once for each input line. However, there are many programming situations where you may need to execute initialization code before awk begins processing the text from the input file. For such situations, awk allows you to define a BEGIN block. We used a BEGIN block in the previous example. Because the BEGIN block is evaluated before awk starts processing the input file, it’s an excellent place to initialize the FS (field separator) variable, print a heading, or initialize other global variables that you’ll reference later in the program.
Awk also provides another special block, called the END block. Awk executes this block after all lines in the input file have been processed. Typically, the END block is used to perform final calculations or print summaries that should appear at the end of the output stream.
Awk allows the use of regular expressions to selectively execute an
individual block of code, depending on whether or not the regular
expression matches the current line. Here’s an example script that
outputs only those lines that contain the character sequence
foo:
/foo/ { print }Of course, you can use more complicated regular expressions. Here’s a script that will print only lines that contain a floating point number:
/[0‑9]+\.[0‑9]∗/ { print }There are many other ways to selectively execute a block of code. We
can place any kind of boolean expression before a code block to control
when a particular block is executed. Awk will execute a code block only
if the preceding boolean expression evaluates to true. The following
example script will output the third field of all lines that have a
first field equal to fred. If the first field of the
current line is not equal to fred, awk will continue
processing the file and will not execute the print
statement for the current line:
$1 == "fred" { print $3 }Awk offers a full selection of comparison operators, including the
usual “==”, “<”, “>”, “<=”, “>=”, and “!=”. In addition, awk
provides the “~” and “!~” operators, which mean “matches” and “does not
match”. They’re used by specifying a variable on the left side of the
operator, and a regular expression on the right side. Here’s an example
that will print only the third field on the line if the fifth field on
the same line contains the character sequence root:
$5 ~ /root/ { print $3 }Awk also offers very nice C-like if statements. If you’d like, you
could rewrite the previous script using an if
statement:
{
if ( $5 ~ /root/ ) {
print $3
}
}Both scripts function identically. In the first example, the boolean
expression is placed outside the block, while in the second example, the
block is executed for every input line, and we selectively perform the
print command by using an if statement. Both methods are
available, and you can choose the one that best meshes with the other
parts of your script.
Here’s a more complicated example of an awk if
statement. As you can see, even with complex, nested conditionals,
if statements look identical to their C counterparts:
{
if ( $1 == "foo" ) {
if ( $2 == "foo" ) {
print "uno"
} else {
print "one"
}
} else if ($1 == "bar" ) {
print "two"
} else {
print "three"
}
}Using if statements, we can also transform this code:
! /matchme/ { print $1 $3 $4 }
to this:
{
if ( $0 !~ /matchme/ ) {
print $1 $3 $4
}
}Both scripts will output only those lines that don’t contain
a matchme character sequence. Again, you can choose the
method that works best for your code. They both do the same thing.
Awk also allows the use of boolean operators “||” (for “logical or”) and “&&”(for “logical and”) to allow the creation of more complex boolean expressions:
( $1 == "foo" ) && ( $2 == "bar" ) { print }This example will print only those lines where field one equals
fooand field two equals bar.
So far, we’ve either printed strings, the entire line, or specific fields. However, awk also allows us to perform both integer and floating point math. Using mathematical expressions, it’s very easy to write a script that counts the number of blank lines in a file. Here’s one that does just that:
BEGIN { x=0 }
/^$/ { x=x+1 }
END { print "I found " x " blank lines. :)" }In the BEGIN block, we initialize our integer variable x
to zero. Then, each time awk encounters a blank line, awk will execute
the x=x+1 statement, incrementing x. After all
the lines have been processed, the END block will execute, and awk will
print out a final summary, specifying the number of blank lines it
found.
One of the neat things about awk variables is that they are “simple and stringy.” I consider awk variables “stringy” because all awk variables are stored internally as strings. At the same time, awk variables are “simple” because you can perform mathematical operations on a variable, and as long as it contains a valid numeric string, awk automatically takes care of the string-to-number conversion steps. To see what I mean, check out this example:
x="1.01"
##We just set x to contain the ∗string∗ "1.01"
x=x+1
##We just added one to a ∗string∗
print x
##Incidentally, these are comments :)Awk will output: 2.01
Interesting! Although we assigned the string value 1.01 to the
variable x, we were still able to add one to it. We wouldn’t be able to
do this in bash or python. First of all, bash doesn’t support floating
point arithmetic. And, while bash has “stringy” variables, they aren’t
“simple”; to perform any mathematical operations, bash requires that we
enclose our math in an ugly $( ) construct. If we were
using python, we would have to explicitly convert our 1.01
string to a floating point value before performing any arithmetic on it.
While this isn’t difficult, it’s still an additional step. With awk,
it’s all automatic, and that makes our code nice and clean. If we wanted
to square and add one to the first field in each input line, we would
use this script: { print ($1^2)+1 }
If you do a little experimenting, you’ll find that if a particular variable doesn’t contain a valid number, awk will treat that variable as a numerical zero when it evaluates your mathematical expression.
Another nice thing about awk is its full complement of mathematical operators. In addition to standard addition, subtraction, multiplication, and division, awk allows us to use the previously demonstrated exponent operator “^”, the modulo (remainder) operator “%”, and a bunch of other handy assignment operators borrowed from C.
These include pre- and post-increment/decrement (
i++, --foo), add/sub/mult/div assign operators (
a+=3, b*=2, c/=2.2, d-=6.2 ). But that’s not all — we also
get handy modulo/exponent assign ops as well ( a^=2, b%=4
).
Awk has its own complement of special variables. Some of them allow you to fine-tune how awk functions, while others can be read to glean valuable information about the input. We’ve already touched on one of these special variables, FS. As mentioned earlier, this variable allows you to set the character sequence that awk expects to find between fields. When we were using /etc/passwd as input, FS was set to “:”. While this did the trick, FS allows us even more flexibility.
The FS value is not limited to a single character; it can also be set
to a regular expression, specifying a character pattern of any length.
If you’re processing fields separated by one or more tabs, you’ll want
to set FS like so: FS="\t+".
Above, we use the special “+” regular expression character, which means “one or more of the previous character”.
If your fields are separated by whitespace (one or more spaces or
tabs), you may be tempted to set FS to the following regular expression:
FS="[[:space:]+]".
While this assignment will do the trick, it’s not necessary. Why? Because by default, FS is set to a single space character, which awk interprets to mean “one or more spaces or tabs.” In this particular example, the default FS setting was exactly what you wanted in the first place!
Complex regular expressions are no problem. Even if your records are
separated by the word “foo,” followed by three digits, the following
regular expression will allow your data to be parsed properly:
FS="foo[0‑9][0‑9][0‑9]".
The next two variables we’re going to cover are not normally intended to be written to, but are normally read and used to gain useful information about the input. The first is the NF variable, also called the “number of fields” variable. Awk will automatically set this variable to the number of fields in the current record. You can use the NF variable to display only certain input lines:
NF == 3 { print "this particular record has three fields: " $0 }Of course, you can also use the NF variable in conditional statements, as follows:
{
if ( NF > 2 ) {
print $1 " " $2 ":" $3
}
}The record number (NR) is another handy variable. It will always
contain the number of the current record (awk counts the first record as
record number 1). Up until now, we’ve been dealing with input files that
contain one record per line. For these situations, NR will also tell you
the current line number. However, when we start to process multi-line
records later in the series, this will no longer be the case, so be
careful! NR can be used like the NF variable to print only certain lines
of the input:
(NR < 10 ) || (NR > 100) { print "We are on record number 1‑9 or 101+" }.
Another example:
{
#skip header
if ( NR > 10 ) {
print "ok, now for the real information!"
}
}Awk provides additional variables that can be used for a variety of purposes. We’ll cover more of these variables in later articles. We’ve come to the end of our initial exploration of awk. As the series continues, I’ll demonstrate more advanced awk functionality, and we’ll end the series with a real-world awk application. In the meantime, if you’re eager to learn more, check out the resources listed below.
AWK is a standard tool on every POSIX-compliant UNIX system. It’s like a stripped-down Perl, perfect for text-processing tasks and other scripting needs. It has a C-like syntax but without semicolons, manual memory management or static typing. It excels at text processing. You can call to it from a shell script or you can use it as a stand-alone scripting language.
Why use AWK instead of Perl? Mostly because AWK is part of UNIX. You can always count on it, whereas Perl’s future is in question. AWK is also easier to read than Perl. For simple text-processing scripts, particularly ones that read files line by line and split on delimiters, AWK is probably the right tool for the job.
AWK programs consist of a collection of patterns and
actions. The most important pattern is called BEGIN.
Actions go into brace blocks. BEGIN will run at the
beginning of the program. It’s where you put all the preliminary set-up
code before you process any text files. If you have no text files, then
think of BEGIN as the main entry point. Variables are
global. Just set them or use them, no need to declare…
#!/usr/bin/awk -f
# Comments are like this
BEGIN {
count = 0
# Operators just like in C and friends
a = count + 1
b = count - 1
c = count * 1
d = count / 1 # integer division
e = count % 1 # modulus
f = count ^ 1 # exponentiation
a += 1
b -= 1
c *= 1
d /= 1
e %= 1
f ^= 1
# Incrementing and decrementing by one
a++
b--
# As a prefix operator, it returns the incremented value
++a
--b
# Notice, also, no punctuation such as semicolons to terminate statements
# Control Statements
if ( count == 0 )
print "Starting with count of 0"
else
print "Huh?"
# Or you could use the ternary operator
print ( count == 0 ) ? "Starting with count of 0" : "Huh?"
# Blocks consisting of multiple lines use braces
while ( a < 10 ){
print "String concatenation is done" " with a series" " of" " space-separated strings"
print a
a++
}
for ( i = 0; i < 10; i++ )
print "Good ol' for loop"
# As for comparisons, they're the standards:
a < b # Less than
a <= b # Less than or equal
a != b # Not equal
a > b # Greater than
a >= b # Greater than or equal to
# Logical operators as well
a && b # AND
a !! b # OR
# In addition, there's the super useful regular expression match
if ("foo" ~ "^fo+$")
print "Fooey!"
if ("boo" !~ "^fo+$")
print "Boo!" arr[0] = "foo"
arr[1] = "bar"
# Unfortunately, there is no other way to initialise an array.
# You just have to chug through value line like that.
# You also have associative arrays
assoc["foo"] = "bar"
assoc["bar"] = "baz"
# And multi-dimensional arrays, with some limitations I won't mention here
multidim[0,0] = "foo"
multidim[0,1] = "bar"
multidim[1,0] = "baz"
multidim[1,1] = "boo"
# You can use the "in" operator to traverse the keys of an array
for (key in assoc)
print assoc[key]
# The command line is in a special array called ARGV
for (argnum in ARGV)
print ARGV[argnum]
# You can remove elements of an array
# This is particularly useful to prevent AWK from assuming the arguments are files for it to process
delete ARGV[1]
# The number of command line arguments is in a variable called ARGC
print ARGCAWK has several built-in functions. They fall into three categories. I’ll demonstrate each of them in their own functions defined later.
return value = arithmetic_functions(a, b, c)
string_functions()
io_functions()
}Probably the most annoying part of AWK is that there are no local variables. Everything is global. For short scripts, this is fine, even useful, but for longer scripts, this can be a problem. There is a work-around (ahem, hack). Functions arguments are local to the function, and AWK allows you to define more function arguments that it needs. So just stick local variables in the function declaration, like I did above. As a convention, stick in some extra whitespace to distinguish between actual function parameters and local variables. In this example, a, b and c are actual parameters while d is merely a local variable.
# Here's how you define a function
function arithmetic_functions(a, b, c, d){
# Now to demonstrate the arithmetic functions
# Most AWK implementations have some stadard trig functions
localvar = sin(a)
localvar = cos(a)
localvar = atan2(a, b) # arc tangent of b / a
# And logarithmis stuff
localvar = exp(a)
localvar = log(a)
# Square root
localvar = sqrt(a)
# Truncate floating point to integer
localvar = int(5.34) # localvar => 5
# Random numbers
srand() # Supply a seed as an argument. By default, it uses the time of day
localvar = rand() # Random number between 0 and 1
# Here's how to return a value
return localvar
}AWK, being a string-processing language, has several string-related function, many of which rely heavily on regular expressions.
function string_functions( localvar, arr){
# Search and replace, first instance (sub) or all instances (gsub)
# Both return number of matches replaced
localvar = "fooooobar"
sub("fo+", "Meet me at the ", localvar) # localvar => "Meet me at the bar"
gsub("e+", ".", localvar) # "m..t m. at th. bar"
# Search for a string that matches a regular expression
# index() does the same thing, but doesn't allow a regular expression
match(localvar, "t") # => 4, since the 't' is the fourth character
# Split on a delimiter
split("foo-bar-baz", arr, "-") # a => ["foo", "bar", "baz"]
# Other useful stuff
sprintf("%s %d %d %d", "Testing", 1, 2, 3) # => "Testing 1 2 3
substr("foobar", 2, 3) # => "oob"
substr("foobar", 4) # => "bar"
length("foo") # => 3
tolower("FOO") # => "foo"
toupper("foo") # => "FOO"
}AWK doesn’t have file handles, per se. It will automatically open a file handle for when you use something that needs one. The string you used for this can be treated as a file handle, for purposes of I/O. This makes it feel sort of like shell scripting:
function io_functions( localvar){
# You've already seen print
print "Hello world"
# There's also printf
printf("%s %d %d %d\n", "Testing", 1, 2, 3)
print "foobar" > "/tmp/foobar.txt"
# Now the string "/tmp/foobar.txt" is a file handle. You can close it:
close("/tmp/foobar.txt")
# Here's how you run something in the shell
system("echo foobar") # => prints foobar
# Reads a line from standard input and stores in localvar
getline localvar
# Reads a line from a pipe
"echo foobar" | getline localvar # localvar => "foobar"
close("echo foobar")
# Reads a line from a file and stores in localvar
getline localvar <"/tmp/foobar.txt"
close("/tmp/foobar.txt")
}As I said at the beginning, AWK programs consist of
a collection of patterns and actions. You’ve already seen the
all-important BEGIN pattern. Other patters are used only if
you’re processing lines from files or standard input.
When you pass arguments to AWK, they are treated as
file names to process. It will process them all, in order. Think of it
like an implicit for loop, iterating over the lines in these files.
These patterns and actions are like switch statements inside the loop.
/^fo+bar$/ { this action will execute for every line that
matches the regular expression and will be skipped for any line that
fails to match it. Let’s just print the line: print.
Whoa, no argument! That’s because print has a default argument:
$0. This is the name of the current line being processed.
It is created automatically for you. You can probably guess there are
other $ variables. Every line is implicitly split before
every action is called, much like the shell does. And, like the shell,
each field can be accessed with a dollar sign.
print $2, $4 will print the second and fourth fields.
AWK automatically defines many other variables to help
you inspect and process each line. The most important one is
NF. print NF prints the number of fields on
this line whilst print $NF prints the last field.
} remember to close the action! Recall that we started
with /^fo+bar$/ {.
Every pattern is actually a boolean test. The regular expression in
the last pattern is also a boolean test but part of it was hidden. I
don’t give it a string to test, it will assume $0, the line
that it’s currently processing. Thus, the complete version of it is
this:
$0 ~ /^fo+bar$/ {
print "Equivalent to the last pattern"
}
a > 0 {
# This will execute once for each line, as long as a is positive
}You get the idea. Processing text files, reading in a line at a time and doing something with it, particularly splitting on a delimiter, is so common in UNIX that AWK is a scripting language that does all fo it for you, without you needing to ask. All you have to do is write the patterns and actions based on what you expect of the input, and what you want to do with it.
Here’s a quick example of a simple script, the sort of thing AWK is perfect for. It will read a name from standard input and then will print the average age of everyone with that first name. Let’s say you supply as an argument the name of this data file:
Bob Jones 32
Jane Doe 22
Steve Stevens 83
Bob Smith 29
Bob Barker 72
Here’s the script:
BEGIN {
# First, ask the user for the name
print "What name would you like the average age for?"
# Get a line from standard input, not from files on the command line
getline name < "/dev/stdin"
}
# Now, match wvery line whose first field is the given name
$1 === name {
# Inside here, we have access to a number of useful variables, already
# pre-loaded for us:
# $0 is the entire line
# $3 is the third field, the age, which is what we're interested in here
# NF is the number of fields, which should be 3
# NR is the number of records (lines) seen so far
# FILENAME is the name of the file being processed
# FS is the field separator being used, which is " " here
# …etc. There are plenty more, documented in the man page.
# Keep track of a running total and how many lines matched
sum += $3
nlines++
}Another special pattern is called END. It will run after
processing all the text files. Unlike BEGIN, it will only
run if you’ve given it input to process. It will run after all the files
have been read and processed according to the rules and actions you’ve
provided. The purpose of it is usually to output some kind of final
report, or do something with the aggregate of the data you’ve
accumulated over the course of the script.
END {
if (nlines)
print "The average age for " name " is " sum /nlines
}Fabrizio (Fritz) Stelluto
In this age of npm and github and easily available modules in any language of your choice, it is easy to forget the old Unix workhorses. Here’s a look at awk, a shell utility that allows you to treat and manipulate text files as if they were databases.
Awk is both the name of the command line utility, and
the language used for it. It was invented at Bells Labs at the peak of
punk rock, 1977, and its name is simply the initials of its three
creators. Awk reads input (a file, or a stream) one line (one “record”)
at the time, splits it into fields by blank space (these are all
defaults that can be changed), and then uses the instructions in the awk
language to manipulate these fields and generate some output. The
ability to read files as streams is a big plus - it means the memory
footprint is the same if you read a file of 1Kb or 200Tb; for a larger
file it will just take longer.
Awk is standard with the version of bash that comes with OS X, and several others. There is another variant which is widespread - gawk, GNU awk. It is actually better than the original, because it offers array sort and length functions, the ability to include files, and more flexible rules for splitting input in fields. Here I will limit myself to the standard awk.
Here’s what the simplest awk program looks like - this is basically
cat
# awk loads the short program: {print} and wait for user to type stuff
> awk '{print}'
# as you type, the shell prints out what you are typing. Awk is waiting
# for a <RETURN> outside a ''
It was a bright cold day in April, and the clocks were striking thirteen.
# now awk kicks in and runs the program on the input
# {print} simply prints the input line as it is, so here it is again
It was a bright cold day in April, and the clocks were striking thirteen.
The strong point of awk is that it automatically splits
lines of text as if they were “columns” in a spreadsheet and assigns
each column to a variable (a “field”). Then you can manipulate them and
spit them out
# awk loads a slightly more complex program and waits
> awk '{print $3 ": " $1 + $2}'
# waiting for a <RETURN> outside a ''
10 20 Toronto
# this line is split into 3 "columns", and
# 10 is assigned to $1, 20 to $2, and Toronto to $3
# then the program {print "$3: " $1 + $2} is run - it adds $1 + $2 and
# prints the result out, with some extra text (the :)
Toronto: 30
# now it waits for the next line
20 30 Miami
# same program run on it
Miami: 50Despite its simplicity, you can take awk quite far - for example creating a random sci-fi plot generator.
Running awk on STDIN is not very useful, but of course you can use Unix magic to redirect the input and / or output of the program
# awk will treat the second argument as a path to a file to read from
> awk '{print}' some_data.txt
... # prints whatever was in some_data.txt
>
# exactly the same thing but done differently - redirecting file to STDIN
> awk '{print}' < some_data.txt
... # prints whatever was in some_data.txt
>
# you can read several files, in order
> awk '{print}' some_data.txt more_data.txt
...
>
# now the processed data goes to a separate file
> awk '{print}' some_data.txt > result.txt
>
# the awk program itself can be loaded to a file - here this file is created
> echo '{print}' > awk.txt
# passing the command on to awk with the -f option
> awk -f awk.txt some_data.txt > result.txt
>
# mixing STDIN with files. The "-" is substituted by STDIN, which is dealt with
# after some_data.txt
> ls -l | awk '{print}' some_data.txt - more_data.txt
> ... # prints all lines from some_data.txt
> ... # prints result of ls -l (this is the "-")
> ... # prints all lines from more_data.txt
>
# pass some text into awk, then run an awk program on it
> echo '1 2 3' | awk '{print}'
> 1 2 3
>
# using the curl util to download a csv file, piping it to awk, and running
# the simple awk program on it
> curl http://is.gd/eUrbOZ | awk '{print}'
Forename,Surname,Description on ballot paper,Constituency Name,PANo,Votes,Share
... # etcSo far the awk example consisted of simple one liners - but awk programs can consist of several instructions (“actions”). You can still write them out on the shell:
# note: the ">" is added automatically when hitting return inside a '',
# and the space between > and { was added manually to make it line up
> awk '{print}
> {print}
> {print}'
# now that the closing ' was typed, awk kicks in. this programs simply
# prints out whatever you type three times
oh # typed by you
oh # printed by awk 3 times
oh
ohIn this tutorial I will put the awk program in its own file and load it from the command line - just to make formatting easier and allow comments. The file loaded here has suffix “.awk” but that’s irrelevant, it could be any filename.
> awk -f example.awk some_input_text.txt
An awk program consists of a list of actions, one after the other, and typically one per line (they can be broken up though). There are two special types of actions - BEGIN actions are executed only once, before the text is scanned, and END only once, afterwards. All other actions are executed in order on every line of text. Assume your input file includes increasing integers, one per line
1
2
3Then the program below
BEGIN { print "START!" }
{print "--------------"}
{print}
{print}
{print}
END { print "END!" }Would produce
START!
--------------
1
1
1
--------------
2
2
2
--------------
3
3
3
END!Note that actions can be in any order (they will be executed in the order they are written) and there can be multiple BEGIN and END, so the following is also a legal program.
{print "--------------"}
{print}
END { print "END!" }
BEGIN { print "START!" }
{print}
END { print "Copyright 2005" }
{print}The way the program is dealt with is:
Inside the actions awk offers what most programming languages offer - variable, loops, tests, etc.
Awk follows Unix conventions on most things, so in case of doubt whatever works in Bash scripts tends to work.
# a 'normal' one lne
BEGIN { print "START" }
# you can add newlines for formatting - this is equivalent to the above
BEGIN {
print "START"
}
-> START
-> START
# as in Bash scripts, you can use the semicolon to separate multiple statements
# on the same line...
BEGIN { print "STA"; print "RT";}
# or you can write them one per line, with or without semicolon
BEGIN { print "STA"
print "RT" }
-> STA
RT
-> STA
RTAwk makes several variables available to the programs - some are loaded when the program is launched, some are updated with each line read, some are created by the program itself.
Whenever awk reads a line, it splits it into “fields” by white-space / tab (this is the default and can be overridden), Then each field is copied to a variable $1, $2, …. in order - there is no limit. Additionally, $0 contains the whole line.
# assume this file
1 2 3 4 5 6 7
# the following two lines are equivalent
{ print }
{ print $0 }
-> 1 2 3 4 5 6 7
-> 1 2 3 4 5 6 7
# only prints some fields we are interested in
{ print $1 " " $3 }
-> 1 3The field number doesn’t have to be a constant - it can be an expression or a variable. For example, the global variable NF contains the number of the last field and is updated which every line read. So if there are 7 fields, NF would be 7, and $NF would be $7, i.e. the last field
# assume this file
1 2 3 4 5 6 7
# both mean first and last field - but the first version only works if there are
# 7 fields, the second always works
{print $1 " " $7}
{print $1 " " $NF}
-> 1 7
-> 1 7
#
# print the last two fields
{print $(NF-1) " " $NF}
-> 6 7Another useful global variable that gets updated for each record is NR - this is the record number
# feed a four line input into awk
> echo 'a
> b
> c
> d' | awk '{print NR ") " $1}'
# it prints the line number, ), and the first (and only) field
1) a
2) b
3) c
4) dYou can assign to a field variable with the ‘=’ operator, thereby changing the record content:
# adding something to a field - will only works if it's a number
{$3 = ($3 + 100)
# now print the updated line
print $0}
-> 1 2 103 4 5 6 7If you assign to a field variable that doesn’t exist, it will be added to the record
# the record only contains $1 and $2;
> echo 1 2 | awk '{print $0}'
> 1 2
>
# the program adds two new fields
> echo 1 2 | awk '{$3 = 3; $4 = 4; print $0}'
> 1 2 3 4A few variables are set when the program is launched. Here’s a very short list - if you need to play with these you probably want to get yourself a book on awk.
| Variable | |
|---|---|
| ARGV | array of command line arguments |
| ARGC | number of command line arguments |
| ENVIRON | associative array with environment. Depends on system |
| FILENAME | self explanatory |
To create your own variable, just start assigning to them with the ‘=’ operator - awk will initialize them to an emptry string (which becomes a 0 if used in numeric context). The type of variable is dynamic and can vary during its lifetime.
In the example below, awk is used on the ls command to find the total size of a folder.
# ls -la returns listings in the form:
# -rw-rw-r-- 1 gotofritz staff 1513 Dec 15 2013 .bash_profile
# awk simply collects each filesize and adds it to a running total,
# then prints it at the end
> ls -la | awk ' { total += $5 }
END { print total }'
-> 158448Awk has associative arrays, similar to PHP’s or Javascript. You create an array by using it, no need to initialize it.
# assume this file
10 Life changes fast
20 Life changes in the instant
30 You sit down to dinner and life as you know it ends
40 The question of self-pity
# creates my_array and inserts all lines into it
{ my_array[$1] = $2 }
# creates - note that array is sparse
my_array[10] = "Life changes fast"
my_array[20] = "Life changes in the instant"
my_array[30] = "You sit down to dinner and life as you know it ends"
my_array[40] = "The question of self-pity"
# string keys are also possible
{ my_array["name"] = "Homer" }One thing that is different in awk is that multidimensional arrays use a single set of square brackets to wrap both indices
# assume this file
dad homer
mum marge
son bart
# creates a two dimensional array
{ family["simpsons",$1] = $2 }
-> creates
family["simpsons","dad"] = "homer"
family["simpsons","mum"] = "marge"
family["simpsons","son"] = "bart"Note that arrays in awk are pretty awkward. There are no built in functions to deal with them except for the for … in loop. If you need to sort, or even just find out the length, you’ll have to write your own functions. Alternatively, use gawk which has better array handling.
A regular expression (regexp) is a mini programming language which is used to describe variable strings; it is embedded in most programming languages. Regexps are enclosed in slashes and use a combination of literal characters and punctuation to describe strings. The operator ~ is used to match a regexp, and !~ to ensure it is not matched.
Regular expressions is a complicated topic of its own; here is just a quick introduction
# print all lines with "gmail" in the 1st field
{ if ($1 ~ /gmail/) print}
# prints all lines EXCEPT those with "gmail"
{ if ($1 !~ /gmail/) print}
# ^ indicates start of string.
# this matches "tom" "tomato" but not "atom"
{ if ($1 ~ /^tom/) print}
# $ indicates end of string
# this matches "tom", "atom" but not "tomato"
{ if ($1 ~ /tom$/) print}
# this matches "tom", not "atom" and not "tomato"
{ if ($1 ~ /^tom$/) print}
# . matches any character.
# this matches "bear" "boar" but not "bar"
{ if ($1 ~ /b..r/) print}
# [ABC] matches one character from the set "A", "B", "C"
# this matches "boar" "bear" but not "blar"
{ if ($1 ~ /b[oe]ar/) print}
# [^ABC] matches one character which is anything except "A", "B", "C"
# this matches "blar" but neither "boar" nor "bear"
{ if ($1 ~ /b[^oe]ar/) print}
# (abc) groups the expression abc as a unit.
# | is an "or"
# \ is used to scape special characters, i.e. treat them as normal characters
# in this case we want to treat the '.' as a period and not "any character"
# the following matches @gmail.com or @yahoo.com
{ if ($1 ~ /@(gmail)|(yahoo)\.com/) print}
# * means repeat zero or more. + is repeat once or more. ? is repeat 0 or 1
{ if ($1 ~ /<[^>]+>[^<]*</[^>]+>\.?/) print }
# the following matches < followed by one or more (+) of anything except >, then >
# then zero or more (*) of anything except <
# then </ followed by one or more (+) of anything except >, then >
# then an optional .Awk has the usual loops and conditionals familiar from C. Braces are optional for single nested statements
# braces are optional for single statements
for (name in list_of_names)
print name
for (capital_city in country) {
print capital_city
}
# but needed for multiple statements
if (NR % 2 == 0) {
$2 = $1 * 2
print $0
}Awk doesn’t have booleans. Instead it treats the number 0 or the empty string “” as false, and any other value (including the string “0”) as true. The comparison operators are the familiar ones, with double equal sign for equality, plus the tilde ~ and !~ for regular expression matching, and “in” for array existence
{ if ($1 == "full") ... }
{ if ($2 < 0.5) ... }
{ if ($0 ~ /Republican/) print $0 } ... # matches regexp
{ if ($1 !~ /Completed/) print $0 } ... # rejects regexp
{ if (capital_city in country) print country["capital_city"] }Awk has both for and while loops (including do-while). Additionally, there is the for-in loop for sparse arrays
# assume file
1 10 100
2 20 200
# both these programs will print each line with fields back-to-front
# while loop version...
{ i = NF
line = ""
while (i) {
line = line " " $i
i--
}
print line
}
# for loop version
{ line = ""
for (i=NF; i>0; i--) {
line = line " " $i
}
print line
}
-> 100 10 1
200 20 2
# puts each line of input into the array
{ lines[$NR] = $0 }
# at end prints all the lines
END {
for (line in lines)
print line
}break and continue statements are available
to exit a loop prematurely or skipping an iteration respectively.
next is used to stop precessing a record and moving on
to the next
{ if ($5 == "") next }
{ print $5 $4 }The usual maths operators can be used: +, -, /, * , ++, – plus % for modulus, ^ for exponentiation. Unary + converts to a number
echo "1
> 2
> 3
> 4" | awk '{print $1 ^ 2}'
1
4
9
16Concatenating string in awk is slightly weird. There is no string
concatenation operator; just put the strings next to each other. Because
of that it is recommended to use parenthesis except for trivial cases.
Alternatively, print can take multiple comma separated
arguments - and they will be printed with a space separating them
# assume this file
1
2
3
4
# the strings ($1..) and "a" are concatenated (no space between them) and
# the resulting string is passed to print
{print ($1+2) "a"}
3a
4a
5a
6a
# two separate strings are passed to print - a space is put between them
{print ($1+2), "a"}
3 a
4 a
5 a
6 a
# string concatenation works for variables too
{ something = $1 "--"
print something }
1--
2--
3--
4--There are a number of built in functions: numeric ones like cosine,
square root, random; string functions like print or string length; time
functions and bitwise functions. You can easily find out what they are
by looking at the output of man awk.
Worth nothing that besides print awk also offers
printf, i.e. “print formatted”. Printf is common to many
Unix tools and languages. You give a string with some placeholders and
rules, and then you pass variables to “plug in” those placeholders. The
important thing is the rules, which control things like right alignment,
decimal precision, zero padding for numbers, etc. A statement looks like
this:
{ printf "%-10s %04.3f%% \n", $1, $2 }
# placeholders start with %
# %-10s is a string (s), and is left aligned (-) within a field 10 spaces wide (10)
# %07.3f is a decimal number or float (f), the toal length has to be at least 7
# characters (7) and is padded with zeroes if too small (0) and has
# 3 decimals (3) and minimum 3 in the integer part (7 - 3 decimals - the point)
# %% if you want to print an actual %, you need to type it twice %%
# \n you need to supply the new line manuallyYou can define functions anywhere in your code, outside actions. They are pretty similar to Javascript
# define funtion outside rules - could be at the bottom of file
function my_func(field_content) {
print "FIELD: " field_content
}
# now use in rules
{my_func($1)}Previously I described an awk program as a series of actions, with the special case of BEGIN and END. That’s not entirely correct. An awk program consists of a sequence of actions and optional patterns; BEGIN and END are two special patterns. Incidentally, there is also a BEGINFILE and ENDFILE, for when processing more than one file at the time.
BEGIN and END are special because they identify actions which are not executed for every line of input, but before or after the whole program is run. The other patterns are used on every line to determine whether the action should be run for that particular line or not. Patterns are espressions that return false (i.e., 0 or ““) or true (anything else). When the pattern returns true, the rules is executed.
Regular expressions can be used as pattern; they match the entire line. An exclamation mark reverses the match. Boolean operators can be used to combine patterns
# print lines with an email address
# (very lazy match - will only work if all email addresses are well formed)
/@/ { print $3}
# prints all lines except those with a gmail address
! /@gmail\./ { print $0 }
# prints lines with an @ and the sequence 0160
/@/ && /0160/ { print }The regular expressions above are a shortcut for
$0 ~ /pattern/, i.e. “apply the regexp to whole line”.
Similar rules can be made for individual fields…
# matches only the regx on one field
$1 ~ /Anthony/ { print }..and all expressions seen so far
# print even lines
NR % 2 == 0 { print }
# print only if length of 1st field is greater than 3
# length is a string function mentioned above
length($1) > 3 { print }The reason we have been able to run program without patterns is
because there is a special pattern, the empty pattern, which matches
every line. In fact we could have a program which is just a pattern; the
default action {print} would be executed.
# prints whole line, default action
$1 == "complete"By default awk treats each line as a record. In reality what it does is to split the input by a record separator, stored in the variable RS, which happens to be the new line character. You can change that in an awk program.
# separate records by semicolon
> echo "1 2 3;4 5 6;7 8 9" | awk 'BEGIN {RS = ";" }
> {print}'
1 2 3
4 5 6
7 8 9Something similar is possible with the field separator, which is
stored in the variable FS. By default it is equal to the regexp
[ \t\n]+, i.e. any number of consectuve spaces of any type.
Note that in reality awk cheats - leaving FS default doesn’t just mean
setting it to [ \t\n]+, but also trimming $0 of leading and
trailing empty space before processing it.
# separate fields by comma
> echo "1,2,3
4,5,6
7,8,9" | awk 'BEGIN {FS = "," }
> {print}'
1,2,3
4,5,6
7,8,9You can combine the two together if, for example, your data has one field per line and records are separated by multiple lines - an empty RS means “any number of consecutive \n””
# assume this data
homer simpson
dad
marge simpson
mum
# separate records by any number of newlines, and have one field per line
BEGIN {RS=""; FS="\n"}
{ print $1 " (" $2 ")" }
-> homer simpson (dad)
marge simpson (mum)A field separator (but not a record separator) can be also passed to an awk program in two ways. First of all, awk has a special option for it, -F (note that there is no space between it and the separator). And awk allow passing of variables with the -v syntax, so you could just pass FS that way.
# change separator from within program
BEGIN {FS = "," }
# pass separator with special option -F - note that you don't need quotes
> echo "1,2,3
> 4,5,6" | awk -F, '{print}'
1,2,3
4,5,6
# pass separator as external var with -v
> echo "1,2,3
> 4,5,6" | awk -v FS="," '{print}'
1,2,3
4,5,6i
# in fact you can pass any variable of your choice with -v
> echo "" | awk -v WHAT="grow up" '{print "All children, except one, " WHAT}'
All children, except one, grow upThe naive approach would be to simply set FS=“,” - but that doesn’t cover the fact that some fields are surrounded by quotation marks and others aren’t, and sometimes you have newlines and / or commas inside a field. Here are some examples scripts people have put together to solve the issues. They are also good examples of fairly complex awk scripts.
Personally I think that’s taking things too far - if you have to force awk to create arrays to store manipulated record fragments you may as well use a fully fledged scripting language.
Another approach is to use gawk, and its FPAT variable
One area where standard awk falls short is dealing with input files from your standard desktop applications - basically, CSV files from Office. There hasn’t been a CSV standard until recently, and the CSV generated by MS Office doesn’t work too well with awk. The main issue is that CSV is basically a rubbish format which suffers from a few problems: the comma too common a character to be used as a separator (why didn’t they choose tab??), newlines are saved without being converted to a safe sequence, sometimes fields are surrounded by quotes and sometimes they aren’t.
But if you have a “well behaved” CSV file, i.e. one which doesn’t
have commas, quotation marks, or new lines inside fields, e.g
UK,London,10000 then you can easily process it by passing
-F”,” to the awk call:
awk -F"," -f my_awk_script.awk some_input_data.txt
In practice unless you generate the data yourself there is always going to be the odd comma or quotation mark in your data somewhere; the safest and most reliable course of action is to use tab as separator. I use a free online CSV to TSV converter like this. I then call awk with “\t” as the separator.
The command below is what I use - my awk program is in
my_awk_script.awk, the data is in
uk_electoral_data_converted.csv, and the results goes into
awk_output.txt.
awk -F"\t" -f my_awk_script.awk uk_electoral_data_converted.csv > awk_output.txtSome of the scripts will use data from the 2015 UK election in CSV format as data, converted to TSV. Here’s what it looks like:
Forename Surname Description on ballot paper Constituency Name PANO Votes Share (%) Change FIELD9 Incumbent? FIELD11 Constituency ID Region ID County Region Country Constituency type Party name identifier Party abbreviation
Gerald Howarth The Conservative Party Candidate Aldershot 7 23369 50.6 3.9 MP E14000530 E12000008 Hampshire South East England Borough Conservative Con
Gary Puffett Labour Party Aldershot 7 8468 18.3 6.2 E14000530 E12000008 Hampshire South East England Borough Labour Lab
Bill Walker UK Independence Party (UKIP) Aldershot 7 8253 17.9 13.4 E14000530 E12000008 Hampshire South East England Borough UK Independence Party UKIP
...
And here are the field names in order
1 Forename
2 Surname
3 Description on ballot paper
4 Constituency Name
5 PANO
6 Votes
7 Share (%)
8 Change
9 --
10 Incumbent?
11 --
12 Constituency ID
13 Region ID
14 County
15 Region
16 Country
17 Constituency type
18 Party name identifier
19 Party abbreviation
# put NR > 1 in front of every action to skip the header row
NR > 1 { print }
# result
Gerald Howarth The Conservative Party Candidate Aldershot ...
...# check if the first field is empty
NF { print }NR > 1 && NF { print $4 ": " $2 " " $1 " (" $NF ") " $7 "% " }
## NR > 1 ignore header
## NF ignore empty record
# print constituency name, name surname, party abbreviation, share of the vote
# ignore other fields
Aldershot: Howarth Gerald (Con) 50.6%
Aldershot: Puffett Gary (Lab) 18.3%
Aldershot: Walker Bill (UKIP) 17.9%
Aldershot: Hilliar Alan (LD) 8.8%
Aldershot: Hewitt Carl (Green) 4.4%
...Finds the national Conservative vote
$NF == "Con" {total += $6}
END {print total}
# $NF == "Con" if a record concerns a tory vote
# {total += $6} add it to a running total
## END when all records are processed
# {print total} output total
11299609Find the total vote of the 6 larger parties and their % of the national vote
# keep a running total
NR > 1 && NF {total += $6}
# keep a total for each party - don't do anything yet
$NF == "Con" {total_con += $6}
$NF == "Lab" {total_lab += $6}
$NF == "UKIP" {total_ukip += $6}
$NF == "LD" {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP" {total_snp += $6}
# print a report at the end
END {
print "TOTAL: " total
print "Con: " (100 * total_con / total) "%"
print "Lab: " (100 * total_lab / total) "%"
print "UKIP: " (100 * total_ukip / total) "%"
print "LD: " (100 * total_ld / total) "%"
print "SNP: " (100 * total_snp / total) "%"
print "Green: " (100 * total_green / total) "%"
}
# output:
TOTAL: 30697255
Con: 36.8098%
Lab: 30.449%
UKIP: 12.6431%
LD: 7.87014%
SNP: 4.738%
Green: 3.77112%The same as above, but without copy and paste code
# abstracting copy-and-paste code into a function
function print_party_percentage(party_name, party_vote, total_vote) {
print party_name " " (100 * party_vote / total_vote) "%"
}
# same program as before
NR > 1 && NF {total += $6}
$NF == "Con" {total_con += $6}
$NF == "Lab" {total_lab += $6}
$NF == "UKIP" {total_ukip += $6}
$NF == "LD" {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP" {total_snp += $6}
END {
print "TOTAL: " total
print_party_percentage("Con", total_con, total)
print_party_percentage("Lab", total_lab, total)
print_party_percentage("UKIP", total_ukip, total)
print_party_percentage("LD", total_ld, total)
print_party_percentage("SNP", total_snp, total)
print_party_percentage("Green", total_green, total)
}
# output - looks messier because previous program was manually formatted
TOTAL: 30697255
Con 36.8098%
Lab 30.449%
UKIP 12.6431%
LD 7.87014%
SNP 4.738%
Green 3.77112%Same as above, but using printf for formatting
function print_party_percentage(party_name, party_vote, total_vote) {
printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
# %5s a string (s) of fixed width 5 or more (5) aligned right (if it was -5 it would be left)
# %4.1f a number (f) with one decimal (.1) and total width 4 (4) aligned right (4)
# %% an actual %
}
# same program as before
NR > 1 && NF {total += $6}
$NF == "Con" {total_con += $6}
$NF == "Lab" {total_lab += $6}
$NF == "UKIP" {total_ukip += $6}
$NF == "LD" {total_ld += $6}
$NF == "Green" {total_green += $6}
$NF == "SNP" {total_snp += $6}
END {
print "TOTAL: " total
print_party_percentage("Con", total_con, total)
print_party_percentage("Lab", total_lab, total)
print_party_percentage("UKIP", total_ukip, total)
print_party_percentage("LD", total_ld, total)
print_party_percentage("SNP", total_snp, total)
}
# output
TOTAL: 30697255
Con: 36.8%
Lab: 30.4%
UKIP: 12.6%
LD: 7.9%
SNP: 4.7%There is still some copy and paste code because we are hardcoding the parties. We can use arrays to group whatever parties we find.
# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}
# skip empty and header lines
NR > 1 && NF {
# runnint total
total += $6
# create or update running total for current party
party_totals[$NF] += $6
}
# when all records are processed
END {
print "TOTAL: " total
# print a line for each party
for (party in party_totals)
print_party_percentage(party, party_totals[party], total)
}
# output - there are LOTS of tiny local parties
TOTAL: 30697255
UUP: 0.4%
Left Unity - Trade Unionists and Socialists: 0.0%
IZB: 0.0%
Respect: 0.0%
SSP: 0.0%
NSW: 0.0%
The 30-50 Coalition: 0.0%
.... and so onOh - turns out if you include all the novelty parties there are 132 of them across the UK. We need to sort the array and only print the top X items. Turns out it is quite complicated.
Standard awk’s array are not sortable. This was a design choice - only associative arrays are supported, so there is no order, hence they can’t be sorted in any meaningful way. gawk, however, has two array sorting functions - how do they do it? They actually create a new associative array, with all the values from the original but none of they keys; they keys are replaced by new ones, in order. Then you use a for loop (not the standard for in) to read all the array “in order”. This is all well and good if you don’t need the keys, but I do (they are the name of the party). Besides, I am using awk and not gawk.
The best approach is to create a new array with just the keys, sort that array, and then loop through it in order to find out which keys of the original array to read.
# kickstarts the sort process
# puts all the sorted keys into a separate array. if i
function homebrew_asort(original, processed) {
# before we use the array we must be sure it is empty
empty_array(processed)
original_length = copy_and_count_array(original, processed)
qsort(original, processed, 0, original_length)
return original_length
}
# removes al values
function empty_array(A) {
for (i in A)
delete A[i]
}
# awk doesn't even have an array size function... you also have to roll out your own
function copy_and_count_array(original, processed) {
for (key in original) {
# awk doesn't seem to like array[0] - so we start from 1
size++
processed[size] = key
}
return size
}
## Adapted from a script from awk.info
# http://awk.info/?quicksort
function qsort(original, keys, left, right, i, last) {
if (left >= right) return
swap(keys, left, left + int( (right - left + 1) * rand() ) )
last = left
for (i = left+1; i <= right; i++)
if (original[keys[i]] < original[keys[left]])
swap(keys, ++last, i)
swap(keys, left, last)
qsort(original, keys, left, last-1)
qsort(original, keys, last+1, right)
}
function swap(A, i, j, t) {
t = A[i]; A[i] = A[j]; A[j] = t
}
# same formatting function as before
function print_party_percentage(party_name, party_vote, total_vote) {
printf "%5s: %4.1f%%\n", party_name, (100 * party_vote / total_vote)
}
# same main action as before
NR > 1 && NF {
total += $6
party_totals[$NF] += $6
}
# when all records are processed
END {
parties_count = homebrew_asort(party_totals, keys)
for (i = parties_count; i >= parties_count - 5; i--)
print_party_percentage(keys[i], party_totals[keys[i]], total)
}And the output
Con: 36.8%
Lab: 30.4%
UKIP: 12.6%
LD: 7.9%
SNP: 4.7%
Green: 3.8%
You can easily mimic head -c or tail -c
with awk - if you really want to.
# head equivalent
$ awk '{print substr($0, 1, 32)}' xxx
$ head -c 32 xxx
# tail equivalent
$ awk 'END {print substr($0, length($0) - 30, 32)}' xxx
$ tail -c 32 xxxBut with awk you can also skip a few characters into a file
# no head or tail equivalent - print characters 32 to 64
$ awk '{print substr($0, 32, 32)}' xxxThe awk.info website has some one liners with extensive explanations
The gawk manual includes some one liners which are compatible with standard awk.
The Unix School: 10 examples to group data in a CSV or text file
With that all the main awk topics were touched on. If you want to go deeper I recomend The AWK Manual, or one of the O’Reilly books
In this article, let us review the fundamental awk working methodology along with 7 practical awk print examples.
Note: Make sure you review our earlier Sed Tutorial Series.
Awk is a programming language which allows easy manipulation of structured data and the generation of formatted reports. Awk stands for the names of its authors “Aho, Weinberger, and Kernighan”
The Awk is mostly used for pattern scanning and processing. It searches one or more files to see if they contain lines that matches with the specified patterns and then perform associated actions.
Some of the key features of Awk are:
Awk reads from a file or from its standard input, and outputs to its standard output. Awk does not get along with non-text files.
awk '/search pattern1/ {Actions}
/search pattern2/ {Actions}' fileIn the above awk syntax:
Let us create employee.txt file which has the following content,
which will be used in the
examples mentioned below.
$cat employee.txt
100 Thomas Manager Sales $5,000
200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
400 Nisha Manager Marketing $9,500
500 Randy DBA Technology $6,000By default Awk prints every line from the file.
$ awk '{print;}' employee.txt
100 Thomas Manager Sales $5,000
200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
400 Nisha Manager Marketing $9,500
500 Randy DBA Technology $6,000In the above example pattern is not given. So the actions are
applicable to all the lines.
Action print with out any argument prints the whole line by default. So
it prints all the
lines of the file with out fail. Actions has to be enclosed with in the
braces.
$ awk '/Thomas/
> /Nisha/' employee.txt
100 Thomas Manager Sales $5,000
400 Nisha Manager Marketing $9,500In the above example it prints all the line which matches with the ‘Thomas’ or ‘Nisha’. It has two patterns. Awk accepts any number of patterns, but each set (patterns and its corresponding actions) has to be separated by newline.
Awk has number of built in variables. For each record i.e line, it splits the record delimited by whitespace character by default and stores it in the $n variables. If the line has 4 words, it will be stored in $1, $2, $3 and $4. $0 represents whole line. NF is a built in variable which represents total number of fields in a record.
$ awk '{print $2,$5;}' employee.txt
Thomas $5,000
Jason $5,500
Sanjay $7,000
Nisha $9,500
Randy $6,000
$ awk '{print $2,$NF;}' employee.txt
Thomas $5,000
Jason $5,500
Sanjay $7,000
Nisha $9,500
Randy $6,000In the above example $2 and $5 represents Name and Salary respectively. We can get the Salary using $NF also, where $NF represents last field. In the print statement ‘,’ is a concatenator.
Awk has two important patterns which are specified by the keyword called BEGIN and END.
BEGIN {Actions}
{ACTION} ## Action for everyline in a file
END {Actions}
## is for comments in AwkActions specified in the BEGIN section will be executed before starts
reading the lines from the input.
END actions will be performed after completing the reading and
processing the lines from the input.
$ awk 'BEGIN {print "Name\tDesignation\tDepartment\tSalary";}
> {print $2,"\t",$3,"\t",$4,"\t",$NF;}
> END{print "Report Generated\n--------------";
> }' employee.txt
Name Designation Department Salary
Thomas Manager Sales $5,000
Jason Developer Technology $5,500
Sanjay Sysadmin Technology $7,000
Nisha Manager Marketing $9,500
Randy DBA Technology $6,000
Report Generated
--------------In the above example, it prints headline and last file for the reports.
$ awk '$1 >200' employee.txt
300 Sanjay Sysadmin Technology $7,000
400 Nisha Manager Marketing $9,500
500 Randy DBA Technology $6,000In the above example, first field ($1) is employee id. So if $1 is greater than 200, then just do the default print action to print the whole line.
Now department name is available as a fourth field, so need to check if $4 matches with the string “Technology”, if yes print the line.
$ awk '$4 ~/Technology/' employee.txt
200 Jason Developer Technology $5,500
300 Sanjay Sysadmin Technology $7,000
500 Randy DBA Technology $6,000Operator ~ is for comparing with the regular expressions. If it matches the default action i.e print whole line will be performed.
The below example, checks if the department is Technology, if it is yes, in the Action, just increment the count variable, which was initialized with zero in the BEGIN section.
$ awk 'BEGIN { count=0;}
$4 ~ /Technology/ { count++; }
END { print "Number of employees in Technology Dept =",count;}' employee.txt
Number of employees in Tehcnology Dept = 3Then at the end of the process, just print the value of count which gives you the number of employees in Technology department.