Contents

Using BEGIN and END
Fields
Built In Variables
printf
Strings
Control Flow
loops
Arrays
Awk functions

Using BEGIN and END

We shall use the same technique of placing the awk commands in a file .




File: awk7

BEGIN { print "BEGIN" }  {  } END { print "END"  }

In the above before awk processes the input
it prints the word "BEGIN" . Now data is processed
one line at a time but the action that we have
is blank so nothing is done. After the data has
been processed the "END" statement which prints
"END" to the console is executed.

$ awk -f awk7 marks.txt
BEGIN
END




File: awk8

BEGIN { sum=0 ; average=0; noPeople=0 }  { sum += $3 ; noPeople++ }

END { print "Average marks:", sum/noPeople  }

$ awk -f awk8 marks.txt
Average marks: 66.7

We can use user defined variables in our awk
script. The variables are similar to the variables
in Shell . We do not have to declare the type.
The increment operator "++" increases the value by 1.
In the above we initialize the variables "sum" , "average"
and "noPeople" to 0. Then our action works for
each line. It sums up the marks and counts the
people.

Exercise:
1) Modify the command file to print:

Number of people:   (Actual number of people)
Total: ( Actual total marks )
Average marks: 66.7


BEGIN { sum=0 ; average=0; noPeople=0 }
{ sum += $3 ; noPeople++ }
END { average =  sum/noPeople; print "Average marks:",  average }

We are going to change the above command slightly
by removing the "{" at the end of "END" .

File: "awk9"
BEGIN { sum=0 ; average=0; noPeople=0 }
{ sum += $3 ; noPeople++ }
END
{
average =  sum/noPeople
print "Average marks:",  average }


File: awk9


$  awk -f awk9 marks.txt
awk: awk9:4: END blocks must have an action part

We need the "{" or "}: at the end of BEGIN or
END block or we can use the slash "\" to correct
the syntax.


File: awk10


BEGIN { sum=0 ; average=0; noPeople=0 }  { sum += $3 ; noPeople++ }
END \
 {
average =  sum/noPeople
print "Average marks:",  average }

$  awk -f awk10 marks.txt
Average marks: 66.7

Fields

Assume we have a data file called "marks.txt"

File: "marks1.txt"

1) Amit     Physics    80
2) Rahul    Maths      90
3) Shyam    Biology    87
4) Kedar    English    85
5) Hari     History    89

The fields are labelled as $1, $2
and so on to represent first, second
fields and so on. The "$0" represents
the whole line. The "print" command
without any arguments will print the whole line.

awk '{print}' marks1.txt

or the equivalent statement:

awk '{print $0}' marks1.txt

The field no does not have to be a constant

awk 'BEGIN {var1=3}  {print $var1}' marks1.txt
$ awk 'BEGIN {var1=3}  {print $var1}' marks1.txt
Physics
Maths
Biology
English
History


We can separate the items in a print statement
with a comma. The string literals must be quoted.

awk 'BEGIN {var1=3}  {print $var1, "  " , $(var1-1)}' marks1.txt

Built In Variables

ENVIRON


ENVIRON is an associative array holding
info about environment variables.

$ awk 'BEGIN { print ENVIRON["USER"] }'
amittal

$ awk 'BEGIN { print ENVIRON["PATH"] }'

/usr/local/bin:/usr/bin:/usr/local/sbin:
/usr/sbin:/sbin:/users/amittal/.local/bin:
/users/amittal/bin

Notice since the awk command is small and
does not take a data file we can use a
single command.

FS

This is the field separator. By default
it's value is space but we can change that.

$ echo "first:second:third" | awk 'BEGIN { FS=":" } { print $1,$2,$3 }'
first second third

RS

RS is the record separator. Usually this is the new line but we can change that.

$ echo "first:second:third" | awk 'BEGIN { RS=":" } { print $1 }'
first
second
third

$ echo "1) Amit     Physics    80:2) Rahul    Maths      90" | awk 'BEGIN { RS=":" } { print $2,$3 }'
Amit Physics
Rahul Maths
In the above the record separator is the ":"
instead of the new line separator.

Exercise:

echo "line1a:line1b:line1c&line2a:line2b:line2c&" | awk -f f1.cmd

Write ":f1.cmd" to have the RS
as & and FS as : to print the output as:

line1a:line1b:line1c
line2a:line2b:line2c


NR

The "NR" field represents the record number.

$  echo "first:scond:third" | awk 'BEGIN { RS=":" } { print $1, NR }'
first 1
scond 2
third 3

$ cat marks1.txt | awk '{ print $2, NR }'
Amit 1
Rahul 2
Shyam 3
Kedar 4
Hari 5

Exercise

1) Modify the original example with:

BEGIN { sum=0 ; average=0; noPeople=0 }  { sum += $3 ; noPeople++ }
END { print "Average marks:", sum/noPeople  }

Take out the "noPeople" and instead use NR .
awk -f nr.cmd marks.txt

File: "marks2.txt"

Id Name Grade
---------------------
1)    John        80
2)    Peter       90
3)    David       47
4)    James       25
5)    Lisa        89
6)      Kenny       56
7)      Sam         95
8)      Julia       74
9)      Cassie      66
10)     Marelena    45

BEGIN { sum=0 ; average=0; noPeople=0 }
{ sum += $3 ; noPeople++ }
END { print "Average marks:", sum/noPeople  }

Add the condition ( NR > 2) to the
above command so that the first 2 lines
are skipped when doing the calculations.


Solution
1)
BEGIN { sum=0 ; average=0 }  { sum += $3  }
END { print "Average marks:", sum/NR  }

NF

The "NF" represents the number of fields
in a record. We can use this to grab the
last field from a record.

$ echo "first scond third" | awk '{ print $NF }'
third

Exercise

1)Use NF and the condition ( NR > 2 )
to just print the grade from the previous example.

printf

The "printf" function allows us to specify format specifiers. The "printf" function is very powerful and has extra features that are not there in the " "print" function.



File: p1.cmd


{ printf( "%10s%10s%10s\n", $1 , $2 , $3 ) }

cat marks.txt | awk -f p1.cmd

$ cat marks.txt | awk -f p1.cmd
        1)      John       80
        2)     Peter       90
        3)     David       47
        4)     James       25
        5)      Lisa       89
        6)     Kenny       56
        7)       Sam       95
        8)     Julia       74
        9)    Cassie       66
       10)  Marelena       45

We can specify a place holder in the first
argument by using the percent symbol.
Then we specify the value after the
first argument. In the above we are
stating that the first argument be
used for "%10s" . The "s" means the
value is a string, We must have have
the same number of variables as the
place holders. The "10" means reserve
10 spaces for the string. If the string
is smaller then it is padded with spaces.
This can help in aligning the values.



We do not need to specify a format string.


File: p2.cmd


{ printf( "Id Name Marks\n" ) }
$ echo ""  | awk -f p2.cmd

Id Name Marks

The function "print" will print a new line
by default but "printf" does not do that .

We can use the usual backspace characters
of "\n" to represent new line and "\t"
to represent tabs.

Format Specifiers

%c ASCII Character
%d Decimal integer
%e Floating Point number
%f Floating Point number
%g The shorter of e or f,
%o Octal
%s String
%x Hexadecimal
%% Literal %

We do not have types in the awk language
but a variable can be assigned a value and
then we can print that value out if it contains
the same type that we are specifying in the
"printf" string. If we state the "%s" then
we need to supply a string.

We saw how the statement:

{ printf( "%10s%10s%10s\n", $1 , $2 , $3 ) }

allocated a width of 10 for the string. The
spaces are padded on the left. If we want
the string to be on the left hand side with
the spaces padded on the right then we use
the "-10" notation.

File: "p2.cmd"

{ printf( "%-10s%-10s%\n", $1 , $2  ) }
$  awk -f p2.cmd marks.txt
1)         John
2)         Peter
3)         David
4)         James
5)         Lisa
6)         Kenny
7)         Sam
8)         Julia
9)         Cassie
10)        Marelena

We can also restrict the number of decimal points with the ".2f" kind spedifier.

$ echo "" | awk '{ printf("%.2f" , 3.41256) }'
3.41

In the above we are stating that the floating point
value should only have 2 fraction digits at most.

Exercise:
Exercise:

1)Write an awk command in file "pr4.cmd" .
Create a file "pr4.sh" that will have
the following line.

File: "pr4.sh"
cat marks.txt  | awk -f pr4.cmd
Run the file "./pr4.sh" to produce the output:
$ ./pr4.sh

Id         Name          Marks

1)--John--80
2)--Peter--90
3)--David--47
4)--James--25
5)--Lisa--89
6)--Kenny--56
7)--Sam--95
8)--Julia--74
9)--Cassie--66
10)--Marelena--45

2)
The int function can be used to
retain the number and throw away
the fractional part. It can be
used as int( 3.142 ). Use the
printf to change the following file.

File: "data1.txt"
1.5     3.1425
14.23    7.5678
3.7     8.6523
4.9     9.4567

to

1     3.14
14    7.57
3     8.65
4     9.46

Strings

Concatenation of strings.
There is no explicit operation to join strings. All we have to do is write the strings next to each other.



File: s1.cmd


 {
  str1="Ajay" "Mittal"
  print str1
  str1="Ajay"
  str1 = str1 " " "Ajay"
  print str1
}

$ echo "" | awk -f s1.cmd
AjayMittal
Ajay Ajay

Even though the "s1.cmd" does not really
need an input we need to give something to the
awk command and we give a blank string.


The expression

str1 " " "Ajay"

joins 3 strings. The contents of
the string str1 and a blank space
and the string "Ajay" .



File: s2.cmd

 {
  str1="table"
  str2 = ""
  l1 = length( str1 )
  for( i1=l1; i1 > 0 ; i1-- )
   {
     #print i1
     str2 = str2 substr( str1, i1, 1 )
     #print str2
   }
  print str2

 }

$  echo "" | awk -f s2.cmd
elbat

The above code reverses the word in the
variable "str1". The function substr
has 3 arguments. The first argument is
the string. The second argument is
the position that we need to grab
the sub string from and the
third argument is the number
of characters we need to grab.
If "str1" contains the string
"table" then some possible examples are:

substr( str1, 1, 1 )    Result is "t"
Position is 1 and we grab 1 character.

substr( str1, 1, 3 )    Result is "tab"
Position is 1 and we grab 3 characters.

substr( str1, 2 )       Result is "able"
Position is 2 and we grab rest of
the characters in the string.

We are also using a "for" loop in this
example:

  for( i1=l1; i1 > 0 ; i1-- )

A loop repeats the statements inside it's
block. We have the initialization statement:

"i1=l1"

Then we have the check
"i1>0"

The loop executes the statements till
the condition is true and then we have
the update statement:

i1--

This runs after the block of the loop has
been executed.


File: s3.cmd


 {
  str1="wood table"
  split ( str1 , arr1, " " )
  print arr1[1]
  print arr1[2]
 }

$ echo "" | awk -f s3.cmd
wood
table

The "split" function splits the input
string and places the split strings
into an array that can be indexed
by numbers.

Comments
Comments in awk are preceded by the hash symbol.

Exercise
1)

Add some comments after the BEGIN part
but before the action part in any of
the previous exercises.

Control Flow

If condition

Let us modify our "marks2.txt" slightly .

File: "marks2.txt"

1)    John      M  80
2)    Peter     M  90
3)    David     M  47
4)    James     M  25
5)    Lisa      F  89
6)    Kenny     M  56
7)    Sam       M  95
8)    Julia     F  74
9)    Cassie    F  66
10)   Marelena  F  45

and our awk command:


File: awk11

BEGIN { sum=0 ; average=0; noPeople=0 }  {
if ( $3 == "F" )
 {
   print $0
   noPeople++
   sum += $4 ;
 }

}
END { print "Average marks:", sum/noPeople  }

 awk -f awk11 marks2.txt

$ awk -f awk11 marks2.txt
5)    Lisa      F  89
8)    Julia     F  74
9)    Cassie    F  66
10)   Marelena  F  45
Average marks: 68.5


We can use the semicolon to separate
each statement. If a statement is on
a line and the next statement
is on another line then the semicolon
is not necessary.

Using the "if" with "else if"

File: awk12

BEGIN { sum1=0 ; average1=0; noPeople1=0
   sum2=0 ; average2=0; noPeople2=0
 }

{
  if ( $3 == "F" )
   {
   print $0
   noPeople1++
   sum1 += $4 ;
  }
 else if ( $3 == "M" )
  {
   print $0
   noPeople2++
   sum2 += $4 ;
  }

}

END { print "Average marks for F:", sum1/noPeople1
print "Average marks for M:", sum2/noPeople2
  }



$  awk -f awk12 marks2.txt
1)    John      M  80
2)    Peter     M  90
3)    David     M  47
4)    James     M  25
5)    Lisa      F  89
6)    Kenny     M  56
7)    Sam       M  95
8)    Julia     F  74
9)    Cassie    F  66
10)   Marelena  F  45
Average marks for F: 68.5
Average marks for M: 65.5

Exercise

Using the above data file determine the
person with the highest marks and the
person with the lowest mark.

John has the highest mark of 90.
James has the lowest mark of 25.

loops

The for loop has the strucure:

for(  Initial ; Condtion ; Post )
    {
        //Body of the loop
   }

The "initialization" part is run once
and can be used to initialize variables.
The condition part is tested and if
true the body of the loop is executed.
After the body has been executed
the post statementis run. After
which the condition is tested
again and so on till the condition
becomes false.

Ex:


File: loop1.cmd

 {
   print "For loop"
   for( i1=0 ; i1<3 ; i1++)
     print i1
 }

echo "" | awk -f loop1.cmd

$ echo "" | awk -f loop1.cmd
For loop
0
1
2


Ex:

File: loop2.cmd

{
  ind1 = 2 ;
  ind2 = $0 - 1
  #print ind2 ;
  isPrime = 1 ;
  for ( ; ind1 <= ind2 ; ind1++ )
   {
     if ( $0 % ind1 == 0 )
      isPrime = 0 ;
   }

  if ( isPrime == 1 && length( $0 ) > 0  )
   {
     printf "%d is a prime number\n", $1

   }

}


data_prime.txt:

20
21
23
17
7
8
9


$ awk -f loop2.cmd data_prime.txt
23 is a prime number
17 is a prime number
7 is a prime number

There is another notation for
through the array and that is:

for( i1 in array )
  do something

For each item in the array the
variable "i1" will take on the
value of the "index" element
and we can access the array
value with the notation "array[i1]" .


File: awk_states

BEGIN {
state["Dublin"] = "California";
state["Reno"] = "Nevada"
state["San Jose"] = "California"
state["Oakland"] = "California"
state["Las Vegas"] = "Nevada"

 for( str1 in state )
  print str1 , state[str1]

}

awk -f awk_states

$ awk -f awk_states
Reno Nevada
Dublin California
Las Vegas Nevada
San Jose California
Oakland California

While loops

The structure of the while loop is

while (  condition )
{
   Body
}

As long as the "condition" is true the body
of the loop is executed. It is similar to the
for loop.



File: while1.cmd

 {
   print "While loop"
   i1=0
   while  ( i1 < 3 )
     {
        print i1
        i1++
     }
}


awk -f while1.cmd
$ echo "" | awk -f while1.cmd
While loop
0
1
2

Exercise

1)
Modify the prime no example to use "while" loop
instead of "for" loop.

Arrays

Let's assume we have a file called "cities.txt" File: "cities.txt" 1) "Dublin" 2) "Reno" 3) "San Jose" 4) "Oakland" 5) "Las Vegas" File: "cities" { print $2 } awk -f cities cities.txt $ awk -f cities cities.txt "Dublin" "Reno" "San "Oakland" "Las We see that the output is not what we want. There isn't any easy way to fix this in awk. What we can do is change the field separator with our sed command. File: "convert_cities.sh" cat cities.txt | sed -r 's/[ ]+/|/' > cities1.txt
File: cities1.txt
1)|"Dublin" 2)|"Reno" 3)|"San Jose" 4)|"Oakland" 5)|"Las Vegas" We have replaced the first series of spaces with the pipeline character "|" . Another way of getting around this problem is to use the quotation as the field separator character.
File: cities1
BEGIN { FS="|" } { print $2 } In the BEGIN section we list our file separator as "|" with the command: FS="|" $ awk -f cities1 cities1.txt "Dublin" "Reno" "San Jose" "Oakland" "Las Vegas We do not have to declare the array or it's size and the arrays are associative which means it's subscript value could be a number or string. Exercise: Write a command in file that prints the cities using the quotation mark as the separator. We can of course use the awk arrays in the traditional sense: The Fibonacci series is of the form 1,1,2,3,5,8 We start out with 2 numbers ( 1 and 1 ) and the next number is the sum of the previous 2 numbers.
File: fib1
BEGIN { #holder 1 to 10 for fibonacci number holder[1] = 1 holder[2] = 1 for( i1=3 ; i1<=10 ; i1++ ) { holder[ i1 ] = holder[i1-1] + holder[i1-2] } for( i1=1 ; i1<=10 ; i1++ ) { print holder[i1] } } awk -f fib1 $ awk -f fib1 1 1 2 3 5 8 13 21 34 55

Awk functions

We have awk built in functions that are provided to us and we can also define our own functions if we wish.



File: awk13


BEGIN {
   arr[0] = "Three"
   arr[1] = "One"
   arr[2] = "Two"
   print "Array elements before sorting:"
   for (i1 in arr) {
      print arr[i1]
   }
   asort(arr)
   print "Array elements after sorting:"

   for (i1 in arr) {
      print arr[i1] , length( arr[i1] )
   }

}

$ awk -f awk13
Array elements before sorting:
Three
One
Two
Array elements after sorting:
One 3
Three 5
Two 3

We are using the "asort" function
to sort and the "length" function
to obtain the length of the string.

In the below example we have
written a function that returns
a 1 if the number passed to it
in the argument is a prime number.


File: awk14


function isPrimeNo( num1 )
{
  ind1 = 2 ;
  ind2 = $0 - 1
  #print ind2 ;
  isPrime = 1 ;
  for ( ; ind1 <= ind2 ; ind1++ )
   {
     if ( $0 % ind1 == 0 )
      isPrime = 0 ;
   }

  if ( isPrime == 1 && length( $0 ) > 0  )
   {
     #print $0, " is a prime number."
     return 1
   }
  return 0
}

{

  if (  isPrimeNo( $0 )  == 1 )
    printf $0  " is a prime number."



}



File: data14.txt


20
21
23
17
7
8
9

$ awk -f awk14 data14.txt
23 is a prime number.
17 is a prime number.
7 is a prime number.


Exercise

1) File: "data15.txt"

2 3
4 2
3 3
10 3

Use the file "power1.cmd" to fill in the function for power.

File: "power1.cmd"

function power( num1 , num2 )
{
  //TO DO
}

{
  value=power($1, $2)
  printf  $1, $2 , value

}

$ cat data.txt | awk -f power1.cmd
2 3 8
4 2 16
3 3 27
10 3 1000