Monday, November 22, 2010

A newbie tutorial on UNIX Awk utility


Sharing a few Awk one liners (very basic and random) which will help the Awk newbies to gain a quick confidence on one of the beautiful UNIX tools - Awk. Some of my recommendations for learning UNIX Awk are mentioned at the bottom of this post. Have a great time with Awk.

Meaning of some of the Awk Built-in Variables used below:

NF : Number of fields in current line/record
NR : Ordial number of current line/record
FS : Field Separator (Also -F can be used)
OFS : Output Field Separator (default=blank)
FILENAME : Name of current input file


All of following Awk one liners are based on the following input file 'test1.txt'

$ cat test1.txt
Continent:Val
AS:12000
AF:9800
AS:12300
NA:3400
OC:12000
AF:500
AS:1000

A)
#Print 'line number' NR and 'Number of fields' NF for each line

$ awk -F ":" '{print NR,NF}' test1.txt
1 2
2 2
3 2
4 2
5 2
6 2
7 2
8 2

B)
#Print first field, colon delimited

$ awk -F ":" '{print $1}' test1.txt
Continent
AS
AF
AS
NA
OC
AF
AS

C)
#Same as B, but excluding the 'first line' (NR!=1)

$ awk -F ":" 'NR!=1 {print $1}' test1.txt
AS
AF
AS
NA
OC
AF
AS

D)
#Same as B but only for line number 1 (NR==1)

$ awk -F ":" 'NR==1 {print $1}' test1.txt
Continent

E)
#Print first and second field

$ awk -F ":" 'NR!=1 {print $1,$2}' test1.txt
AS 12000
AF 9800
AS 12300
NA 3400
OC 12000
AF 500
AS 1000

F)
#Setting output field separator as pipe

$ awk -F ":" 'BEGIN{OFS="|"} NR!=1 {print $1,$2}' test1.txt
AS|12000
AF|9800
AS|12300
NA|3400
OC|12000
AF|500
AS|1000

G)
#FS and OFS can be included in BEGIN section

$ awk 'BEGIN{FS=":"; OFS="|"} NR!=1 {print $1,$2}' test1.txt
AS|12000
AF|9800
AS|12300
NA|3400
OC|12000
AF|500
AS|1000

H)
#Anything on BEGIN executes first

$ awk 'BEGIN{FS=":"; OFS="|"; print "Con|SomeVal"} NR!=1 {print $1,$2}' test1.txt
Con|SomeVal
AS|12000
AF|9800
AS|12300
NA|3400
OC|12000
AF|500
AS|1000

I)
#Printing FILENAME, will be printed for all the lines

$ awk -F ":" '{print FILENAME}' test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt
test1.txt

#Same as above but printing only last instance using END clause

$ awk -F ":" ' END {print FILENAME}' test1.txt
test1.txt

J)
#Revisiting NF, number of fields in each line

$ awk -F ":" '{print NF}' test1.txt
2
2
2
2
2
2
2
2

K)
#Printing the last field of the file, same as printing $2 as there are only 2 fields

$ awk -F ":" '{print $NF}' test1.txt
Val
12000
9800
12300
3400
12000
500
1000

L)
#Matching, printing lines begin with "AS"

$ awk -F ":" '/^AS/' test1.txt
AS:12000
AS:12300
AS:1000

#'Not condition' of the L

$ awk -F ":" '!/^AS/' test1.txt
Continent:Val
AF:9800
NA:3400
OC:12000
AF:500

#Direct matching, first field as "AS"

$ awk -F ":" '$1=="AS"' test1.txt
AS:12000
AS:12300
AS:1000

$ awk -F ":" '$1=="AS" {print $2}' test1.txt
12000
12300
1000

#$0 prints the full line, same as {print}

$ awk -F ":" '$1=="AS" {print $0}' test1.txt
AS:12000
AS:12300
AS:1000

$ awk -F ":" '$1=="AS" {print}' test1.txt
AS:12000
AS:12300
AS:1000

M)
# 'Or' condition

$ awk -F ":" '$1=="AS" || $1=="OC" {print}' test1.txt
AS:12000
AS:12300
OC:12000
AS:1000

# 'Or' and 'AND' together

$ awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print}' test1.txt
AS:12000
AS:12300
OC:12000

$ awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print $2-10000}' test1.txt
2000
2300
2000

$ awk -F ":" '($1=="AS" || $1=="OC") && $NF > 11000 {print ($2-($2%100))}' test1.txt
12000
12300
12000

N)
#Partial Matching

$ awk -F ":" '$1 ~ /AS/ {print}' test1.txt
AS:12000
AS:12300
AS:1000

$ awk -F ":" '$1 ~ /A/ {print}' test1.txt
AS:12000
AF:9800
AS:12300
NA:3400
AF:500
AS:1000

O)
#Reading from STDOUT

$ cat test1.txt | awk -F ":" '!/Continent/ {print $1}' | sort | uniq
AF
AS
NA
OC

P)
#Add value 1000 to the 2nd field, where first field is "AF" and then print the output file

$ awk -F ":" '$1=="AF" {$2+=1000} {print}' test1.txt
Continent:Val
AS:12000
AF 10800
AS:12300
NA:3400
OC:12000
AF 1500
AS:1000

#As no OFS is mentioned above, by default, OFS is blank, specifying it now

$ awk -F ":" 'BEGIN {OFS=":"} $1=="AF" {$2+=1000} {print}' test1.txt
Continent:Val
AS:12000
AF:10800
AS:12300
NA:3400
OC:12000
AF:1500
AS:1000

Q)
#Sum of 2nd fields, exclude first line

$ awk -F ":" 'NR!=1 {sum+=$NF} END {print sum}' test1.txt
51000

#If END is not mentioned ?

$ awk -F ":" 'NR!=1 {sum+=$NF} {print sum}' test1.txt
12000
21800
34100
37500
49500
50000
51000

#Average of 2nd field, as first field is excluded, (NR-1) instead of NR for total number of items

$ awk -F ":" 'NR!=1 {sum+=$NF} END {print sum/(NR-1)}' test1.txt
7285.71

$ awk -F ":" 'NR!=1 && $1=="AS" {sum+=$NF} END {print sum}' test1.txt
25300

R) 'Group by' and 'Count' using associative array in awk

$ awk ' BEGIN {FS=OFS=":"}
NR==1 {print "Continent:Count:Sum(val)"}
NR>1 {a[$1]++;b[$1]=b[$1]+$2} END{for (i in a) print i,a[i],b[i]}' test1.txt

Continent:Count:Sum(val)
OC:1:12000
NA:1:3400
AF:2:10300
AS:3:25300

#Count of each of the continents(1st field)

$ awk -F ":" '
NR==1 {next}
NR>1 {a[$1]++} END{for (i in a) print i,a[i]}' test1.txt
OC 1
NA 1
AF 2
AS 3

#An alternative

$ awk -F ":" 'NR!=1 {print $1}' test1.txt | sort | uniq -c
2 AF
3 AS
1 NA
1 OC

#Reading from STDOUT

$ cat test1.txt | awk 'BEGIN {OFS=":"} {print NR-1,$0}'
0:Continent:Val
1:AS:12000
2:AF:9800
3:AS:12300
4:NA:3400
5:OC:12000
6:AF:500
7:AS:1000

S)
#Accessing external variable in awk:

$ awk -F ":" '$1=="AS" {print $2}' test1.txt
12000
12300
1000

#Suppose value of filterval is "AS"

$ filterval="AS"
$ echo $filterval
AS

#Accessing value of variable filterval inside Awk

$ awk -F ":" -v con=$filterval '$1==con {print $2}' test1.txt
12000
12300
1000

#Set 2nd value as 0 where first field is "AS"

$ awk -F ":" 'BEGIN {OFS=":"} $1=="AS" {$2=0} {print}' test1.txt
Continent:Val
AS:0
AF:9800
AS:0
NA:3400
OC:12000
AF:500
AS:0

My recommendation for AWK:
- A complete tutorial on AWK from vectorsite
- Awk One line page
- Peteris Krumins has done a great job by explaining each and every one lines in details, a must read : part1 part2

Related posts from my blog:
- A post on awk substr function
- Difference between awk NR and FNR variables

Wednesday, November 17, 2010

Sort file based on last field - UNIX

From SORT(1) man page:

-n, --numeric-sort
compare according to string numerical value

-k, --key=POS1[,POS2]
start a key at POS1, end it at POS2 (origin 1)

-t, --field-separator=SEP
use SEP instead of non-blank to blank transition

1) File with same number of fields in each line


$ cat file.txt
IN,90,453
US,12,120
NZ,89,200
WI,20,500
TS,12,124


$ sort -t, -n -k3 file.txt
US,12,120
TS,12,124
NZ,89,200
IN,90,453
WI,20,500

2) File with uneven number of fields in each line

$ cat file.txt
IN,90,453
US,12,1,120
NZ,89,200
WI,500
TS,12,124

Solution : The idea is join the last field(column) of each line at the beginning of each of the lines (with a different delimiter, in this case I am using pipe "|")

$ awk 'BEGIN {FS=","; OFS="|"}{print $NF,$0}' file.txt
453|IN,90,453
120|US,12,1,120
200|NZ,89,200
500|WI,500
124|TS,12,124

Then sort on 1st field with pipe ("|") delimited.

$ awk 'BEGIN {FS=","; OFS="|"}{print $NF,$0}' file.txt | sort -n -t"|" -k1
120|US,12,1,120
124|TS,12,124
200|NZ,89,200
453|IN,90,453
500|WI,500

Then discard the first field i.e. print the 2nd field with pipe ("|") delimited.

$ awk 'BEGIN {FS=","; OFS="|"}{print $NF,$0}' file.txt | sort -n -t"|" -k1 | awk -F "|" '{print $NF}'
US,12,1,120
TS,12,124
NZ,89,200
IN,90,453
WI,500

Which is same as:

$ awk 'BEGIN {FS=","; OFS="|"}{print $NF,$0}' file.txt | sort -n -t"|" -k1 | awk -F "|" '{print $2}'

or

$ awk 'BEGIN {FS=","; OFS="|"}{print $NF,$0}' file.txt | sort -n -t"|" -k1 | cut -d"|" -f2


A question to unstableme readers:
- Is there a better way to directly mention the last field number (as counting field number incase line has huge number of fields is never going to be easy) ? Thanks in advance, much appreciated.

Related posts:
- UNIX sort strings by string length using awk in Bash
- Changing temporary directory for UNIX sort command
- Sort date in ddmmyyyy format using awk

Monday, November 15, 2010

Vi open file at required line number

Sometime we are required to open a file in vim and then need to go to a particular line number in order to check something or to edit something. This is how we can open a file in vim and directly put the cursor on the required line of our interest.

Way1:
Using vim -c option:

From VIM(1) man page:

-c {command}
{command} will be executed after the first file has been read. {command} is interpreted as an
Ex command. If the {command} contains spaces it must be enclosed in double quotes (this
depends on the shell that is used). Example: Vim "+set si" main.c

Following command will open file 'run_parallel_qs.py' in vim with the cursor on line number 23.

$ vim -c 23 run_parallel_qs.py


Way2:
One can use the + to specify the cursor starting point.

$ vim run_parallel_qs.py +23

which is same as

$ vim +23 run_parallel_qs.py

Without line number, i.e. the following command will open the file in vim and will put the cursor on last line

$ vim run_parallel_queries.py +

And as we know just opening a file in vi/vim, the cursor will be on line number one.

Related UNIX vim editor tips:
- Open file under cursor in Vi Editor
- A short tutorial on using tab window in Vi editor
- Pull word under cursor in Vim
- Start a new shell from Vi editor
- How to edit remote file using ssh in Vi editor

© Jadu Saikia http://unstableme.blogspot.com