Monday, November 30, 2009

Awk - extract negative numbers from file


Posting this entry so that awk newbies will see the use of awk NF,NR variables along with some basic awk 'for' and 'if constructs' use.

Input file:

$ cat file.txt
-5232,-92338,84545,34
-2233,25644,23233,2
6211,-1212,4343,43
-2434,621171,9121,-33

Required: Extract the numbers which starts with '-'

Solution using bash tr (command to translate or delete characters)

$ tr ',' '\n' < file.txt | grep ^-
Output:
-5232
-92338
-2233
-1212
-2434
-33

Using awk:

$ awk -F "," '
{for(i=1; i<=NF;i++)
if($i ~ /^-/) {printf "%s\n",$i}
}
' file.txt

Extending the above one liner to print the line number and field number where the negative value is present.

$ awk -F "," '
{for(i=1; i<=NF;i++)
if($i ~ /^-/) {
printf "Line # %s,Field # %s,Value = %s\n",NR,i,$i
}
}' file.txt

Output:
Line # 1,Field # 1,Value = -5232
Line # 1,Field # 2,Value = -92338
Line # 2,Field # 1,Value = -2233
Line # 3,Field # 2,Value = -1212
Line # 4,Field # 1,Value = -2434
Line # 4,Field # 4,Value = -33

Related posts:

- Awk for loop
- Awk if else
- Awk variables

Saturday, November 28, 2009

Few important Bash Shell shortcuts

Sharing few of common bash shell shortcuts:

Ctrl + a - Jump to the start of the line
Ctrl + e - Jump to the end of the line

Ctrl + l - Clear the screen
Ctrl + r - Search the history backwards

Ctrl + b - Move back a char
Ctrl + f - Move forward a char

Ctrl + k - Delete to EOL (its actually a 'cut', you can paste using Ctrl + y)
Ctrl + u - Delete backward till begining of the line (its actually a 'cut', you can paste using Ctrl + y)

Alt + b - Move backward by a word
Alt + f - Move forward by a word

Alt + d - Delete word, keep deleting word forward
Ctrl + w - Delete from the cursor position, then one word at a time backward

Related post:

- Flip last two characters in bash shell

Friday, November 27, 2009

Awk - replace blank spaces with single space

Input file 'file.txt' contains fields separated with uneven spaces (or tabs)

$ cat file.txt
6767 1212 9090 12
657676 1212 21212 21232
76767 12121 909090 121212
12 9090 1212 21


Required: Replace one or more space with single space or single tab or comma.

The solutions using awk:

$ awk -v OFS="," '$1=$1' file.txt
Output:
6767,1212,9090,12
657676,1212,21212,21232
76767,12121,909090,121212
12,9090,1212,21


$ awk -v OFS="\t" '$1=$1' file.txt
Output:
6767 1212 9090 12
657676 1212 21212 21232
76767 12121 909090 121212
12 9090 1212 21


$ awk -v OFS=" " '$1=$1' file.txt
Output:
6767 1212 9090 12
657676 1212 21212 21232
76767 12121 909090 121212
12 9090 1212 21


The solution using sed is here

Tuesday, November 24, 2009

Expand entries in file using awk - bash

Input file:

$ cat data.txt
Manager3|sw5
Manager2|sw engg9,sw12
Manager1|sw1,sw4,sw2,sw engg0

Output required:

Manager3|sw5
Manager2|sw engg9
Manager2|sw12
Manager1|sw1
Manager1|sw4
Manager1|sw2
Manager1|sw engg0

I have already posted (using awk and another using python) the reverse solution of the above, i.e. converting the above expected output to the input file format.

Awk solution:

$ awk -F "|" '{
n=split($2,Arr,",")
for(i=1; i<=n;i++){printf "%s|%s\n",$1,Arr[i]}}
' data.txt

Thursday, November 19, 2009

Convert fixed length file to csv - awk

Input file:

$ cat data.txt
k12582927001611USNA
k12582990001497INAS
k12583053001161LNEU

Required output:

k,1258292700,1611,US,NA
k,1258299000,1497,IN,AS
k,1258305300,1161,LN,EU

Awk solutions:

Using GAWK(1)

From GAWK(1) man page:

If the FIELDWIDTHS variable is set to a space separated list of numbers, each field is expected to have fixed width, and gawk splits up the record using the specified widths. The value of FS is ignored.
Assigning a new value to FS overrides the use of FIELDWIDTHS, and restores the default behavior.

$ awk -v FIELDWIDTHS='1 10 4 2 2' -v OFS=',' '
{ $1=$1 ""; print }
' data.txt

Another alternative using awk substr function

$ awk '{
one=substr($0,1,1)
two=substr($0,2,10)
three=substr($0,12,4)
four=substr($0,16,2)
rest=substr($0,18)
printf ("%s,%s,%s,%s,%s\n", one, two, three, four, rest)
}' data.txt

Sunday, November 15, 2009

Diff command ignoring certain lines

Lets try to use some of important command line options available with UNIX/Linux DIFF command.

Contents of 'file1.txt' and 'file2.txt' are:

$ cat file1.txt
k,99,32332,10.24
p,676,211,121.44
#Some comment1
n,908,121,12121.54
#Some comment1
l,na,90.23,23
l,na,20.28,23


$ cat file2.txt
k,99,32332,10.24

p,676,211,121.44
#Some comment1
n,908,121,12121.54
#Some comment1
l,na,90.23,23
#Some more comment
l,na,20.28,23

A normal diff command output:

$ diff file1.txt file2.txt
1a2
>
3c4
< #Some comment1
---
> #Some commentr3
6a8
> #Some more comment

Some of DIFF(1) important command line options:
-q --brief
Output only whether files differ.

$ diff -q file1.txt file2.txt
Files file1.txt and file2.txt differ

-B --ignore-blank-lines
Ignore changes whose lines are all blank.

$ diff -B file1.txt file2.txt
3c4
< #Some comment1
---
> #Some commentr3
6a8
> #Some more comment

-I RE --ignore-matching-lines=RE
Ignore changes whose lines all match RE.
So in this case lets ignore all lines which start with '#'

$ diff -B -I '^#' file1.txt file2.txt
$ echo $?
0

So no difference.

A small and simple bash script function for the same:

#!/bin/sh

_Diff() {
local file1=$1
local file2=$2
echo "Diff-ing '$file1' and '$file2' \
ignoring 'blank' lines and lines \
starting with '#'"
diff -q -B -I '^#' ${file1} ${file2} > /dev/null
[ $? -eq 0 ] && echo "Passed" || echo "Failed"
}

_Diff /tmp/newsch.txt /tmp/oldsch.txt

Another good option available with DIFF(1)

-i, --ignore-case
Ignore case distinctions in both the PATTERN and the input files. (-i is specified by POSIX.)

Related post:

- Diff remote files using ssh in Linux

Thursday, November 12, 2009

Find last modified directory in UNIX

Question: In my current working directory how can I find the last modified directory ?

Way1:

$ ls -lrt | awk '/^d/ { f=$NF }; END{ print f }'

Way2:

$ ls -d --sort=time */ | head -n 1

Relates post:

- Find the latest file in a directory in UNIX

Any better way to find this? please suggest; really appreciated. Thanks.

Wednesday, November 11, 2009

Printing single quote in awk - bash

Input file:

$ cat /tmp/file.txt
Computer programming:Zia:78
discrete mathematics:Nil:82
Quantum physics:Leni:91
biomedical engineering:Qureg:82
computer architecture:Anu:90

Required output:

Top in 'Computer programming' : 'Zia'
Top in 'discrete mathematics' : 'Nil'
Top in 'Quantum physics' : 'Leni'
Top in 'biomedical engineering' : 'Qureg'
Top in 'computer architecture' : 'Anu'

i.e.

Top in '1st field' : '2nd field'

Using awk variable assignment technique, i.e. assigning the value 'single quote' to the variable x below:

$ awk -F: -v x="'" '
{print "Top in",x$1x,":",x$2x}
' /tmp/file.txt

And to use "double-quote":

$ awk -F: -v x="\"" '
{print "Top in",x$1x,":",x$2x}
' /tmp/file.txt

Top in "Computer programming" : "Zia"
Top in "discrete mathematics" : "Nil"
Top in "Quantum physics" : "Leni"
Top in "biomedical engineering" : "Qureg"
Top in "computer architecture" : "Anu"

Another solution will be to use the ASCII for 'single colon':

$ awk -F: '
{print "Top in","\x27"$1"\x27",":","\x27"$2"\x27"}
' /tmp/file.txt

Since number of fields in the input file is very few, we can try this using 'sed'; something like:

$ sed "
s_\(.*\):\(.*\):\(.*\)_Top in '\1' : '\2'_g
" /tmp/file.txt

Above, as you can see I am using underscore symbol in place of common slash "/" (just to avoid confusion)

Related posts:

- Print a column using UNIX sed
- How to access external variable in awk and sed

Tuesday, November 10, 2009

Sum of numbers in file - UNIX alternatives

Input file:

$ cat /tmp/file.txt
286
255564800
609
146
671290

Required: Add (Sum) all the numbers present in the above file.

Way#1: This is supposed to be the most popular way of doing an addition of numbers present in a particular field of a file.

$ awk '{s+=$0} END {print s}' /tmp/file.txt
256237131

Way#2: Using UNIX/Linux 'paste' command and 'bc'

$ paste -sd+ /tmp/file.txt
286+255564800+609+146+671290

$ paste -sd+ /tmp/file.txt | bc
256237131

Way#3: Using UNIX/Linux 'tr' command and 'bc'

$ tr -s '\n' '+' < /tmp/file.txt
286+255564800+609+146+671290+

$ echo $(tr -s '\n' '+' < /tmp/file.txt)
286+255564800+609+146+671290+

#Since there's an extra '+' at end of above output, echo an additional '0' like this
$ echo $(tr -s '\n' '+' < /tmp/file.txt)0
286+255564800+609+146+671290+0

$ echo $(tr -s '\n' '+' < /tmp/file.txt)0 | bc
256237131

Way#4: Same as above but doing the arithmetic without using 'bc'

$ printf "%d\n" $(( $(tr -s '\n' '+' < /tmp/file.txt) 0 ))
256237131

Way#5: Using sed and 'bc'

$ sed 's/$/+/' /tmp/file.txt
286+
255564800+
609+
146+
671290+

$ echo $(sed 's/$/+/' /tmp/file.txt) 0
286+ 255564800+ 609+ 146+ 671290+ 0

$ echo $(sed 's/$/+/' /tmp/file.txt) 0 | bc
256237131

Way#6 : or a basic bash script using for loop

sum=0
for num in $(cat /tmp/file.txt)
do
((sum+=num))
done
echo $sum

Way#7: Using python

>>> sum = 0
>>> lines = open("/tmp/file.txt", "r").readlines()
>>> lines
['286\n', '255564800\n', '609\n', '146\n', '671290\n']
>>> for line in lines:
... sum+=eval(line)
...
>>> sum
256237131

Related posts:

- 'Sum of' and 'group by' using awk
- Sum using awk substr function in bash
- Bash 'while loop' sum issue explained
- 'Exponential' value in awk sum output
- Python - adding numbers in a list

Saturday, November 7, 2009

Construct range from numbers - awk

Required:
With the numbers between 100 and 139 whose last digit is in between 0-3, construct the following output:

100-103
110-113
120-123
130-133

Step by step solution:
1) Numbers between 100 and 139 whose last digit is between 0-3

$ seq 100 139 | grep '[0-3]$'

Output:

100
101
102
103
110
111
112
113
120
121
122
123
130
131
132
133

2) Make them a single line with comma separated

$ seq 100 139 | grep '[0-3]$' | paste -sd,

Output:

100,101,102,103,110,111,112,113,120,121,122,123,130,131,132,133

3) Split the above into multiple sub-lines with each line containing 4 numbers

$ seq 100 139 | grep '[0-3]$' | paste -sd, | awk -F, '
{ for(i=1;i<=NF;i++)
{printf("%s%s",$i,i%4?",":"\n")}
}'

Output:

100,101,102,103
110,111,112,113
120,121,122,123
130,131,132,133

4) Print the first and last field

$ seq 100 139 | grep '[0-3]$' | paste -sd, | awk -F, '
{ for(i=1;i<=NF;i++)
{printf("%s%s",$i,i%4?",":"\n")}
}' | awk -F, '{print $1"-"$NF}'

Output:

100-103
110-113
120-123
130-133

I am sure there must be better ways to achieve this, please comment.

Related post:

- Break a line into multiple lines using awk and sed

Monday, November 2, 2009

Bash - numbering lines in file using awk

Input file 'file.txt' contains names of few students.

$ cat file.txt
Sam G
Ashok Niak
Rosy M
Peter K
Sid Thom
Rasi Yad
Papu S
Niaraj J
Aloh N K
Nipu H
Quam L

Required output:

For the entries of the above file,
- add a serial number to each line
- Also add 'House' number such that all the students are group into total 4 houses in the following fashion:

Sl No,Name,House
1,Sam G,House1
2,Ashok Niak,House2
3,Rosy M,House3
4,Peter K,House4
5,Sid Thom,House1
6,Rasi Yad,House2
7,Papu S,House3
8,Niaraj J,House4
9,Aloh N K,House1
10,Nipu H,House2
11,Quam L,House3

The awk solution using awk NR variable:

$ awk '
BEGIN {OFS=","; print "Sl No,Name,House"}
{print NR,$0,"House"((NR-1)%4)+1}
' file.txt

Lets format the output for a better look:

$ awk '
BEGIN {
FORMAT="%-8s%-18s%s\n" ;
{printf FORMAT,"Sl No","Name","House"}
}
{printf FORMAT,NR,$0,"House"((NR-1)%4)+1}
' file.txt

Output:

Sl No Name House
1 Sam G House1
2 Ashok Niak House2
3 Rosy M House3
4 Peter K House4
5 Sid Thom House1
6 Rasi Yad House2
7 Papu S House3
8 Niaraj J House4
9 Aloh N K House1
10 Nipu H House2
11 Quam L House3

Read about text alignment using awk printf function here

A Bash script for the same will be something like this:

#!/bin/sh
i=0
while read
do
echo "$((i+1)),$REPLY,House$((i++ % 4 + 1))"
done < file.txt

Output:

$ sh numbering.sh
1,Sam G,House1
2,Ashok Niak,House2
3,Rosy M,House3
4,Peter K,House4
5,Sid Thom,House1
6,Rasi Yad,House2
7,Papu S,House3
8,Niaraj J,House4
9,Aloh N K,House1
10,Nipu H,House2
11,Quam L,House3

Now a question:
What is that '$REPLY' in the above script ?

Answer: '$REPLY' is the default value when a variable is not supplied to read.

So the above script is same as:

#!/bin/sh
i=0
while read line
do
echo "$((i+1)),$line,House$((i++ % 4 + 1))"
done < file.txt


In general, numbering of the lines of a file can be done in several ways viz

Using UNIX/Linux nl(1) command - number lines of files

$ nl file.txt
1 Sam G
2 Ashok Niak
3 Rosy M
4 Peter K
5 Sid Thom
6 Rasi Yad
7 Papu S
8 Niaraj J
9 Aloh N K
10 Nipu H
11 Quam L

Using awk NR:

$ awk '{print "\t"NR"\t"$0}' file.txt
1 Sam G
2 Ashok Niak
3 Rosy M
4 Peter K
5 Sid Thom
6 Rasi Yad
7 Papu S
8 Niaraj J
9 Aloh N K
10 Nipu H
11 Quam L

Using sed syntax:

$ sed = file.txt | sed 'N;s/\n/\t/'
1 Sam G
2 Ashok Niak
3 Rosy M
4 Peter K
5 Sid Thom
6 Rasi Yad
7 Papu S
8 Niaraj J
9 Aloh N K
10 Nipu H
11 Quam L

© Jadu Saikia http://unstableme.blogspot.com