Contents
Using BEGIN and END
We shall use the same technique of placing the awk commands in a file .
File: awk7
BEGIN { print "BEGIN" } { } END { print "END" } In the above before awk processes the input it prints the word "BEGIN" . Now data is processed one line at a time but the action that we have is blank so nothing is done. After the data has been processed the "END" statement which prints "END" to the console is executed. $ awk -f awk7 marks.txt BEGIN END
File: awk8
BEGIN { sum=0 ; average=0; noPeople=0 } { sum += $3 ; noPeople++ } END { print "Average marks:", sum/noPeople } $ awk -f awk8 marks.txt Average marks: 66.7 We can use user defined variables in our awk script. The variables are similar to the variables in Shell . We do not have to declare the type. The increment operator "++" increases the value by 1. In the above we initialize the variables "sum" , "average" and "noPeople" to 0. Then our action works for each line. It sums up the marks and counts the people. Exercise: 1) Modify the command file to print: Number of people: (Actual number of people) Total: ( Actual total marks ) Average marks: 66.7 BEGIN { sum=0 ; average=0; noPeople=0 } { sum += $3 ; noPeople++ } END { average = sum/noPeople; print "Average marks:", average } We are going to change the above command slightly by removing the "{" at the end of "END" . File: "awk9" BEGIN { sum=0 ; average=0; noPeople=0 } { sum += $3 ; noPeople++ } END { average = sum/noPeople print "Average marks:", average }
File: awk9
$ awk -f awk9 marks.txt awk: awk9:4: END blocks must have an action part We need the "{" or "}: at the end of BEGIN or END block or we can use the slash "\" to correct the syntax.
File: awk10
BEGIN { sum=0 ; average=0; noPeople=0 } { sum += $3 ; noPeople++ } END \ { average = sum/noPeople print "Average marks:", average } $ awk -f awk10 marks.txt Average marks: 66.7
Fields
Assume we have a data file called "marks.txt"
File: "marks1.txt"
1) Amit Physics 80
2) Rahul Maths 90
3) Shyam Biology 87
4) Kedar English 85
5) Hari History 89
The fields are labelled as $1, $2
and so on to represent first, second
fields and so on. The "$0" represents
the whole line. The "print" command
without any arguments will print the whole line.
awk '{print}' marks1.txt
or the equivalent statement:
awk '{print $0}' marks1.txt
The field no does not have to be a constant
awk 'BEGIN {var1=3} {print $var1}' marks1.txt
$ awk 'BEGIN {var1=3} {print $var1}' marks1.txt
Physics
Maths
Biology
English
History
We can separate the items in a print statement
with a comma. The string literals must be quoted.
awk 'BEGIN {var1=3} {print $var1, " " , $(var1-1)}' marks1.txt
Built In Variables
ENVIRON
ENVIRON is an associative array holding
info about environment variables.
$ awk 'BEGIN { print ENVIRON["USER"] }'
amittal
$ awk 'BEGIN { print ENVIRON["PATH"] }'
/usr/local/bin:/usr/bin:/usr/local/sbin:
/usr/sbin:/sbin:/users/amittal/.local/bin:
/users/amittal/bin
Notice since the awk command is small and
does not take a data file we can use a
single command.
FS
This is the field separator. By default
it's value is space but we can change that.
$ echo "first:second:third" | awk 'BEGIN { FS=":" } { print $1,$2,$3 }'
first second third
RS
RS is the record separator. Usually this is the new line but we can change that.
$ echo "first:second:third" | awk 'BEGIN { RS=":" } { print $1 }'
first
second
third
$ echo "1) Amit Physics 80:2) Rahul Maths 90" | awk 'BEGIN { RS=":" } { print $2,$3 }'
Amit Physics
Rahul Maths
In the above the record separator is the ":"
instead of the new line separator.
Exercise:
echo "line1a:line1b:line1c&line2a:line2b:line2c&" | awk -f f1.cmd
Write ":f1.cmd" to have the RS
as & and FS as : to print the output as:
line1a:line1b:line1c
line2a:line2b:line2c
NR
The "NR" field represents the record number.
$ echo "first:scond:third" | awk 'BEGIN { RS=":" } { print $1, NR }'
first 1
scond 2
third 3
$ cat marks1.txt | awk '{ print $2, NR }'
Amit 1
Rahul 2
Shyam 3
Kedar 4
Hari 5
Exercise
1) Modify the original example with:
BEGIN { sum=0 ; average=0; noPeople=0 } { sum += $3 ; noPeople++ }
END { print "Average marks:", sum/noPeople }
Take out the "noPeople" and instead use NR .
awk -f nr.cmd marks.txt
File: "marks2.txt"
Id Name Grade
---------------------
1) John 80
2) Peter 90
3) David 47
4) James 25
5) Lisa 89
6) Kenny 56
7) Sam 95
8) Julia 74
9) Cassie 66
10) Marelena 45
BEGIN { sum=0 ; average=0; noPeople=0 }
{ sum += $3 ; noPeople++ }
END { print "Average marks:", sum/noPeople }
Add the condition ( NR > 2) to the
above command so that the first 2 lines
are skipped when doing the calculations.
Solution
1)
BEGIN { sum=0 ; average=0 } { sum += $3 }
END { print "Average marks:", sum/NR }
NF
The "NF" represents the number of fields
in a record. We can use this to grab the
last field from a record.
$ echo "first scond third" | awk '{ print $NF }'
third
Exercise
1)Use NF and the condition ( NR > 2 )
to just print the grade from the previous example.
printf
The "printf" function allows us to specify format specifiers. The "printf" function is very powerful and has extra features that are not there in the " "print" function.
File: p1.cmd
{ printf( "%10s%10s%10s\n", $1 , $2 , $3 ) } cat marks.txt | awk -f p1.cmd $ cat marks.txt | awk -f p1.cmd 1) John 80 2) Peter 90 3) David 47 4) James 25 5) Lisa 89 6) Kenny 56 7) Sam 95 8) Julia 74 9) Cassie 66 10) Marelena 45 We can specify a place holder in the first argument by using the percent symbol. Then we specify the value after the first argument. In the above we are stating that the first argument be used for "%10s" . The "s" means the value is a string, We must have have the same number of variables as the place holders. The "10" means reserve 10 spaces for the string. If the string is smaller then it is padded with spaces. This can help in aligning the values. We do not need to specify a format string.
File: p2.cmd
{ printf( "Id Name Marks\n" ) } $ echo "" | awk -f p2.cmd Id Name Marks The function "print" will print a new line by default but "printf" does not do that . We can use the usual backspace characters of "\n" to represent new line and "\t" to represent tabs. Format Specifiers %c ASCII Character %d Decimal integer %e Floating Point number %f Floating Point number %g The shorter of e or f, %o Octal %s String %x Hexadecimal %% Literal % We do not have types in the awk language but a variable can be assigned a value and then we can print that value out if it contains the same type that we are specifying in the "printf" string. If we state the "%s" then we need to supply a string. We saw how the statement: { printf( "%10s%10s%10s\n", $1 , $2 , $3 ) } allocated a width of 10 for the string. The spaces are padded on the left. If we want the string to be on the left hand side with the spaces padded on the right then we use the "-10" notation. File: "p2.cmd" { printf( "%-10s%-10s%\n", $1 , $2 ) } $ awk -f p2.cmd marks.txt 1) John 2) Peter 3) David 4) James 5) Lisa 6) Kenny 7) Sam 8) Julia 9) Cassie 10) Marelena We can also restrict the number of decimal points with the ".2f" kind spedifier. $ echo "" | awk '{ printf("%.2f" , 3.41256) }' 3.41 In the above we are stating that the floating point value should only have 2 fraction digits at most. Exercise: Exercise: 1)Write an awk command in file "pr4.cmd" . Create a file "pr4.sh" that will have the following line. File: "pr4.sh" cat marks.txt | awk -f pr4.cmd Run the file "./pr4.sh" to produce the output: $ ./pr4.sh Id Name Marks 1)--John--80 2)--Peter--90 3)--David--47 4)--James--25 5)--Lisa--89 6)--Kenny--56 7)--Sam--95 8)--Julia--74 9)--Cassie--66 10)--Marelena--45 2) The int function can be used to retain the number and throw away the fractional part. It can be used as int( 3.142 ). Use the printf to change the following file. File: "data1.txt" 1.5 3.1425 14.23 7.5678 3.7 8.6523 4.9 9.4567 to 1 3.14 14 7.57 3 8.65 4 9.46
Strings
Concatenation of strings.There is no explicit operation to join strings. All we have to do is write the strings next to each other.
Comments
File: s1.cmd
{ str1="Ajay" "Mittal" print str1 str1="Ajay" str1 = str1 " " "Ajay" print str1 } $ echo "" | awk -f s1.cmd AjayMittal Ajay Ajay Even though the "s1.cmd" does not really need an input we need to give something to the awk command and we give a blank string. The expression str1 " " "Ajay" joins 3 strings. The contents of the string str1 and a blank space and the string "Ajay" .
File: s2.cmd
{ str1="table" str2 = "" l1 = length( str1 ) for( i1=l1; i1 > 0 ; i1-- ) { #print i1 str2 = str2 substr( str1, i1, 1 ) #print str2 } print str2 } $ echo "" | awk -f s2.cmd elbat The above code reverses the word in the variable "str1". The function substr has 3 arguments. The first argument is the string. The second argument is the position that we need to grab the sub string from and the third argument is the number of characters we need to grab. If "str1" contains the string "table" then some possible examples are: substr( str1, 1, 1 ) Result is "t" Position is 1 and we grab 1 character. substr( str1, 1, 3 ) Result is "tab" Position is 1 and we grab 3 characters. substr( str1, 2 ) Result is "able" Position is 2 and we grab rest of the characters in the string. We are also using a "for" loop in this example: for( i1=l1; i1 > 0 ; i1-- ) A loop repeats the statements inside it's block. We have the initialization statement: "i1=l1" Then we have the check "i1>0" The loop executes the statements till the condition is true and then we have the update statement: i1-- This runs after the block of the loop has been executed.
File: s3.cmd
{ str1="wood table" split ( str1 , arr1, " " ) print arr1[1] print arr1[2] } $ echo "" | awk -f s3.cmd wood table The "split" function splits the input string and places the split strings into an array that can be indexed by numbers.
Comments in awk are preceded by the hash symbol.
Exercise 1) Add some comments after the BEGIN part but before the action part in any of the previous exercises.
Control Flow
If condition
Let us modify our "marks2.txt" slightly . File: "marks2.txt" 1) John M 80 2) Peter M 90 3) David M 47 4) James M 25 5) Lisa F 89 6) Kenny M 56 7) Sam M 95 8) Julia F 74 9) Cassie F 66 10) Marelena F 45 and our awk command:
File: awk11
BEGIN { sum=0 ; average=0; noPeople=0 } { if ( $3 == "F" ) { print $0 noPeople++ sum += $4 ; } } END { print "Average marks:", sum/noPeople } awk -f awk11 marks2.txt $ awk -f awk11 marks2.txt 5) Lisa F 89 8) Julia F 74 9) Cassie F 66 10) Marelena F 45 Average marks: 68.5 We can use the semicolon to separate each statement. If a statement is on a line and the next statement is on another line then the semicolon is not necessary. Using the "if" with "else if"
File: awk12
BEGIN { sum1=0 ; average1=0; noPeople1=0 sum2=0 ; average2=0; noPeople2=0 } { if ( $3 == "F" ) { print $0 noPeople1++ sum1 += $4 ; } else if ( $3 == "M" ) { print $0 noPeople2++ sum2 += $4 ; } } END { print "Average marks for F:", sum1/noPeople1 print "Average marks for M:", sum2/noPeople2 } $ awk -f awk12 marks2.txt 1) John M 80 2) Peter M 90 3) David M 47 4) James M 25 5) Lisa F 89 6) Kenny M 56 7) Sam M 95 8) Julia F 74 9) Cassie F 66 10) Marelena F 45 Average marks for F: 68.5 Average marks for M: 65.5 Exercise Using the above data file determine the person with the highest marks and the person with the lowest mark. John has the highest mark of 90. James has the lowest mark of 25.
loops
The for loop has the strucure:
for( Initial ; Condtion ; Post )
{
//Body of the loop
}
The "initialization" part is run once
and can be used to initialize variables.
The condition part is tested and if
true the body of the loop is executed.
After the body has been executed
the post statementis run. After
which the condition is tested
again and so on till the condition
becomes false.
Ex:
File: loop1.cmd
{
print "For loop"
for( i1=0 ; i1<3 ; i1++)
print i1
}
echo "" | awk -f loop1.cmd
$ echo "" | awk -f loop1.cmd
For loop
0
1
2
Ex:
File: loop2.cmd
{
ind1 = 2 ;
ind2 = $0 - 1
#print ind2 ;
isPrime = 1 ;
for ( ; ind1 <= ind2 ; ind1++ )
{
if ( $0 % ind1 == 0 )
isPrime = 0 ;
}
if ( isPrime == 1 && length( $0 ) > 0 )
{
printf "%d is a prime number\n", $1
}
}
data_prime.txt:
20
21
23
17
7
8
9
$ awk -f loop2.cmd data_prime.txt
23 is a prime number
17 is a prime number
7 is a prime number
There is another notation for
through the array and that is:
for( i1 in array )
do something
For each item in the array the
variable "i1" will take on the
value of the "index" element
and we can access the array
value with the notation "array[i1]" .
File: awk_states
BEGIN {
state["Dublin"] = "California";
state["Reno"] = "Nevada"
state["San Jose"] = "California"
state["Oakland"] = "California"
state["Las Vegas"] = "Nevada"
for( str1 in state )
print str1 , state[str1]
}
awk -f awk_states
$ awk -f awk_states
Reno Nevada
Dublin California
Las Vegas Nevada
San Jose California
Oakland California
While loops
The structure of the while loop is
while ( condition )
{
Body
}
As long as the "condition" is true the body
of the loop is executed. It is similar to the
for loop.
File: while1.cmd
{
print "While loop"
i1=0
while ( i1 < 3 )
{
print i1
i1++
}
}
awk -f while1.cmd
$ echo "" | awk -f while1.cmd
While loop
0
1
2
Exercise
1)
Modify the prime no example to use "while" loop
instead of "for" loop.
Arrays
Let's assume we have a file called "cities.txt" File: "cities.txt" 1) "Dublin" 2) "Reno" 3) "San Jose" 4) "Oakland" 5) "Las Vegas" File: "cities" { print $2 } awk -f cities cities.txt $ awk -f cities cities.txt "Dublin" "Reno" "San "Oakland" "Las We see that the output is not what we want. There isn't any easy way to fix this in awk. What we can do is change the field separator with our sed command. File: "convert_cities.sh" cat cities.txt | sed -r 's/[ ]+/|/' > cities1.txtFile: cities1.txt
1)|"Dublin" 2)|"Reno" 3)|"San Jose" 4)|"Oakland" 5)|"Las Vegas" We have replaced the first series of spaces with the pipeline character "|" . Another way of getting around this problem is to use the quotation as the field separator character.
File: cities1
BEGIN { FS="|" } { print $2 } In the BEGIN section we list our file separator as "|" with the command: FS="|" $ awk -f cities1 cities1.txt "Dublin" "Reno" "San Jose" "Oakland" "Las Vegas We do not have to declare the array or it's size and the arrays are associative which means it's subscript value could be a number or string. Exercise: Write a command in file that prints the cities using the quotation mark as the separator. We can of course use the awk arrays in the traditional sense: The Fibonacci series is of the form 1,1,2,3,5,8 We start out with 2 numbers ( 1 and 1 ) and the next number is the sum of the previous 2 numbers.
File: fib1
BEGIN { #holder 1 to 10 for fibonacci number holder[1] = 1 holder[2] = 1 for( i1=3 ; i1<=10 ; i1++ ) { holder[ i1 ] = holder[i1-1] + holder[i1-2] } for( i1=1 ; i1<=10 ; i1++ ) { print holder[i1] } } awk -f fib1 $ awk -f fib1 1 1 2 3 5 8 13 21 34 55
Awk functions
We have awk built in functions that are provided to us and we can also define our own functions if we wish.
File: awk13
BEGIN { arr[0] = "Three" arr[1] = "One" arr[2] = "Two" print "Array elements before sorting:" for (i1 in arr) { print arr[i1] } asort(arr) print "Array elements after sorting:" for (i1 in arr) { print arr[i1] , length( arr[i1] ) } } $ awk -f awk13 Array elements before sorting: Three One Two Array elements after sorting: One 3 Three 5 Two 3 We are using the "asort" function to sort and the "length" function to obtain the length of the string. In the below example we have written a function that returns a 1 if the number passed to it in the argument is a prime number.
File: awk14
function isPrimeNo( num1 ) { ind1 = 2 ; ind2 = $0 - 1 #print ind2 ; isPrime = 1 ; for ( ; ind1 <= ind2 ; ind1++ ) { if ( $0 % ind1 == 0 ) isPrime = 0 ; } if ( isPrime == 1 && length( $0 ) > 0 ) { #print $0, " is a prime number." return 1 } return 0 } { if ( isPrimeNo( $0 ) == 1 ) printf $0 " is a prime number." }
File: data14.txt
20 21 23 17 7 8 9 $ awk -f awk14 data14.txt 23 is a prime number. 17 is a prime number. 7 is a prime number. Exercise 1) File: "data15.txt" 2 3 4 2 3 3 10 3 Use the file "power1.cmd" to fill in the function for power. File: "power1.cmd" function power( num1 , num2 ) { //TO DO } { value=power($1, $2) printf $1, $2 , value } $ cat data.txt | awk -f power1.cmd 2 3 8 4 2 16 3 3 27 10 3 1000