Saturday, October 31, 2009

Extract range of lines using sed awk bash


Below are few different ways to print or extract a section of a file based on line numbers.

Lets try to extract lines between line number 27 and line number 99 of input file 'file.txt'

Using sed editor:

$ sed -n '27,99 p' file.txt > /tmp/file1

Which is same as:

$ sed '27,99 !d' file.txt > /tmp/file2

Awk alternative : you can make use of awk NR variable

$ awk 'NR >= 27 && NR <= 99' file.txt > /tmp/file3

Using Linux/UNIX 'head' and 'tail' command:

$ head -99 file.txt | tail -73 > /tmp/file4

Which is basically:

$ head -99 file.txt | tail -$(((99-27)+1)) > /tmp/file5

In vi editor, we can use the following command in ex mode (open the main file 'file.txt' in vi):

:27,99 w! /tmp/file6

i.e. Write lines between line number 27 and line number 99 of main file 'file.txt' to file '/tmp/file6'

Perl alternative would be:

$ perl -ne 'print if 27..99' file.txt > /tmp/file7

And the solution using python:

$ python
Python 2.5.2 (r252:60911, Jul 22 2009, 15:35:03)
[GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2

>>> fp = open("/tmp/file8","w")
>>> for i,line in enumerate(open("file.txt")):
... if i >= 26 and i < 99 :
... fp.write(line)
...
>>>

So the contents of all the output files produced (i.e /tmp/file[1-8]) will be the same (i.e. line number 27 to line number 99 of 'file.txt')

Friday, October 30, 2009

Bash while loop sum issue explained

On one of my directory I had a lot of log files and I had to find the count of the total number of lines which starts with 's' (i.e. ^s).
My first approach was:

$ ls | xargs -i grep -c ^s {} | awk '{sum+=$0} END {print sum}'
190978

And I got my result. Then I thought of performing the same using bash scripting for and while loop and this is what I tried.

#!/bin/sh

sum=0
DIR=~/original
for file in $(ls $DIR)
do
Slines=$(grep -c ^s $DIR/$file)
((sum+=Slines))
#You can also use
#sum=$(expr $sum + $Slines)
#sum=`expr $sum + $Slines`
done
echo $sum

Executing it:

$ ./usingfor.sh
190978

Cool, correct result.

And then I modified the above script for bash while loop:

#!/bin/sh
sum=0
DIR=~/original
ls $DIR | while read file
do
Slines=$(grep -c ^s $DIR/$file)
((sum+=Slines))
done
echo $sum

Executing it:

$ ./usingwhile.sh
0


Oops!!! what went wrong ?

In Bash shell, piping directly to bash while loop causes the bash shell to function in a sub shell.
So in the above example the scope of the 'sum' variable is limited to the sub-shell of the while loop and so the modified value of 'sum' is not reflected when we exit the loop. Value of sum is still 0 (local value) as we initialized it to 0 at the beginning of the script.

The solution of this variable scoping problem with while and direct piping will be:

Remove the direct pipe and feed the list of file names under '~/original' directory as stdin to the while loop as shown below (Basically create a temp file with the file names of the directory '~/original')

#!/bin/sh
sum=0
DIR=~/original
ls $DIR > /tmp/filelist

while read file
do
Slines=$(grep -c ^s $DIR/$file)
((sum+=Slines))
done < /tmp/filelist
echo $sum

Executing it:

$ ./usingwhile_1.sh
190978

And the result is correct.

Wednesday, October 28, 2009

Grep and print control characters in file - unix

One of my input file had some control characters (^B i.e. hex \x02)



On my Ubuntu 8.04.3 and GNU grep version of

$ grep --version
GNU grep 2.5.3

I can grep for any control characters like this:

$ grep '[[:cntrl:]]' /tmp/file.txt
$ grep '[[:cntrl:]]' /tmp/file.txt | less




Also if you know what to grep for, say in above example the control character is ^B (hex \x02); then you can directly grep for it like this

$ grep ^B /tmp/file.txt

* ^B to be typed as ctrl V and ctrl B

And to match any non printable characters, here is another way using grep

$ grep '[^[:print:]]' /tmp/file.txt


To display non printable characters, here is a way using GNU cat command (My cat version : cat GNU coreutils 6.10)

$ cat -v -e -t /tmp/s

Output:

Monday, October 26, 2009

Find n-th occurrence of pattern - vim tip

Question: In vi editor, how can I find or locate the nth occurrence of a particular search pattern ?

Answer: With new vim editor, once you search a pattern say /queryname , type 4n in command mode which will leap to the 4th occurrence of the word 'queryname' from where you are.

So to find or locate the 10th occurrence of a particular pattern, go to the top of the file (:1), search for the pattern (/pattern) and then in command mode type 10n.

Saturday, October 24, 2009

Linux shuf command - generate random permutations

shuf - generate random permutations

Lets discuss the command line options available with Linux/UNIX 'shuf' command

From SHUF(1) man page:
1) -e, --echo
treat each ARG as an input line

$ shuf -e 3 5 6 7
7
6
5
3

$ shuf -e 3 5 6 7
7
5
6
3

$ shuf -e 3 5 6 7
3
6
5
7


2) -i, --input-range=LO-HI
treat each number LO through HI as an input line

To shuffle the numbers between 100 and 200

$ shuf -i 100-200

Also, 'shuf' command can be used along with UNIX/Linux 'seq' or 'jot' command to perform the same as shuf "-i" option.

$ shuf -e $(seq 100 200)
$ shuf -e $(jot 100 100)

3) -n, --head-lines=LINES
output at most LINES lines

$ shuf -i 100-200 -n 3
118
133
117

$ shuf -i 100-200 -n 3
193
188
145

To print a random word in Linux/UNIX

$ shuf -n 1 /usr/share/dict/words
disrupted

$ shuf -n 1 /usr/share/dict/words
festered

Note: /usr/share/dict/words is a standard file on UNIX like operating system and is a newline delimited list of dictionary words.

4) -o, --output=FILE
write result to FILE instead of standard output

$ shuf -n 3 /usr/share/dict/words -o /tmp/dict.txt
$ cat /tmp/dict.txt
heartlands
temple
unsatisfied

Also you can use UNIX/Linux redirection for the same

$ shuf -n 3 /usr/share/dict/words > /tmp/dict.txt

You can shuffle the lines of file and print the output to standard output like this

$ shuf < /tmp/file.txt

Related post:

- Generate random words in Linux in bash

Monday, October 19, 2009

Exponential value is awk sum output

In one of my Debian box with mawk 1.3.3 (mawk is an interpreter for the AWK Programming Language), if I try to add the 2nd fields of the following file using awk:

$ cat data.txt
a:99540232
b:89795683
a:08160808
c:0971544
d:99500728
a:12212539898
d:98065599
e:92640031
a:3129013
c:4085555

The output:

$ awk -F ":" '{sum+=$NF} END {print sum}' data.txt
1.27084e+10

So, awk is giving sum output as exponential format as seen above.

To get the above sum output in integer, here is a way:

$ awk -F ":" '{sum+=$NF} END { printf ("%0.0f\n", sum)} ' data.txt
12708429091

But on my Ubuntu 8.04.3 with awk version:

$ awk --version | head -1
GNU Awk 3.1.6

$ awk -F ":" '{sum+=$NF} END {print sum}' data.txt
12708429091

$ awk -F ":" '{sum+=$NF} END { printf ("%d\n", sum)} ' data.txt
12708429091

$ awk -F ":" '{sum+=$NF} END { printf ("%0.0f\n", sum)} ' data.txt
12708429091

Thursday, October 15, 2009

Awk - split file vertically on columns

I have already put a post on - how we can split a file into multiple sub-files based on different conditions (that was basically a horizontal splitting of file); lets see how we can split a file vertically.

Input file 'file.txt' is a csv file:

$ cat file.txt
A,B,C,D,E,F,G,H,I
1,2,3,4,5,6,7,8,9
I,II,III,IV,V,VII,VIII,IX
a,b,c,d,e,f,g,h,i

Required:

Split the above file into two sub-files such that 1st 3 columns are written to sub-file1 and rest of the columns to sub-file2.
i.e.

sub-file1 content will be

A,B,C
1,2,3
I,II,III
a,b,c

And sub-file2 content will be

D,E,F,G,H,I
4,5,6,7,8,9
IV,V,VII,VIII,IX
d,e,f,g,h,i

Well, this is a pretty simple task using Linux/UNIX cut command

#Printing first 3 columns of 'file.txt'
$ cut -d"," -f1-3 file.txt
or
$ cut -d"," -f-3 file.txt

and

#Printing from 4th column till end
$ cut -d"," -f4-9 file.txt
or
$ cut -d"," -f4- file.txt

Awk solution:

$ awk -F "," '
{
for(i=1;i<=NF;i++) {
if(i <= 3) {
printf "%s,", $i >> "sub-file1"
if(i==3){
printf "\n" >> "sub-file1"
}
} else {
printf "%s,", $i >> "sub-file2"
if(i==NF){
printf "\n" >> "sub-file2"
}
}
}
}' file.txt

Sub-files generated after running the above awk script:

$ cat sub-file1
A,B,C,
1,2,3,
I,II,III,
a,b,c,

$ cat sub-file2
D,E,F,G,H,I,
4,5,6,7,8,9,
IV,V,VII,VIII,IX,
d,e,f,g,h,i,

Wednesday, October 14, 2009

Insert lines from files using Linux paste command

I have already put a post on some good uses of Linux/UNIX 'paste' command; lets check another practical one using paste command.

Input files:

$ cat contestant.txt
Christopher
Williams
Darwin
Ajay
Brain
Amay
Jiten
Lila

$ cat leader.txt
Mr B
Mrs C
Mrs A

Output required:

For every single line of 'leader.txt'; insert 3 lines from file 'contestant.txt'; so that the output looks like this:

Mr B
Christopher
Williams
Darwin
Mrs C
Ajay
Brain
Amay
Mrs A
Jiten
Lila


The step by step solution using Linux/UNIX paste command

$ cat contestant.txt | paste - - -
Output:
Christopher Williams Darwin
Ajay Brain Amay
Jiten Lila

$ cat contestant.txt | paste - - - | paste leader.txt -
Output:
Mr B Christopher Williams Darwin
Mrs C Ajay Brain Amay
Mrs A Jiten Lila

$ cat contestant.txt | paste - - - |paste leader.txt - |tr "\t" "\n"
Output:
Mr B
Christopher
Williams
Darwin
Mrs C
Ajay
Brain
Amay
Mrs A
Jiten
Lila

Another similar one liner for the same:

$ < contestant.txt paste - - - | paste leader.txt - | tr "\t" "\n"

Friday, October 9, 2009

Grouping files using awk in Bash shell

My directory contains a set of log files with filename of the following pattern:
debug.vendor-name.some-serial-number.epoch-time-stamp. device-class.log

where:
epoch-time-stamp
is the UNIX time stamp when the log file is generated.

device-class
first 4 character of this number represent the service-name of the device and next 6 character is for device class name

$ ls -1
debug.cisco.0001.1254059837.svc1class2.log
debug.cisco.0001.1255058827.svc1class3.log
debug.cisco.0001.1255058827.svc2class3.log
debug.cisco.0001.1255058837.svc1class2.log
debug.cisco.0001.1255059834.svc2class3.log
debug.cisco.0002.1255059819.svc1grade2.log
debug.cisco.0002.1255059849.svc1class1.log
debug.cisco.0002.1255059849.svc2class1.log
debug.juniper.0001.1255059831.svc1class2.log

Lets try to group similar files (under different conditions) and count number of files in each of the groups.

One: Group based on vendor-name(2nd field)

$ ls | awk -F "." '{count[$2]++}END{for(j in count) print j,"["count[j]"]"}'

Output:
cisco [8]
juniper [1]

Two: Group based on vendor-name(2nd field) and serial-number(3rd field)

$ ls | awk -F "." '{count[$2" "$3]++}END{for(j in count) print j,"["count[j]"]"}'

Output:
cisco 0002 [3]
juniper 0001 [1]
cisco 0001 [5]

Three: Group based on vendor-name(2nd field) , serial-number(3rd field) and UNIX-time-stamp(4th field) in hour bucketing*

$ ls | awk -F "." '{count[$2" "$3" "$4-($4%3600)]++}
END{for(j in count) print j,"["count[j]"]"}'

Output:
juniper 0001 1255057200 [1]
cisco 0001 1254056400 [1]
cisco 0002 1255057200 [3]
cisco 0001 1255057200 [4]

*hour bucketing :
e.g: 'Fri Oct 9 09:51:55 UTC 2009' and 'Fri Oct 9 09:01:55 UTC 2009' will fall to the same bucket of Fri Oct 9 09:00:00 UTC 2009

Four: Group based on vendor-name(2nd field) and first 4 characters of device-class (5th field)

$ ls | awk -F "." '
{ $5 = substr($5, 0, 4) }
{count[$2" "$5]++}
END{for(j in count) print j,"["count[j]"]"}'

Output:
juniper svc1 [1]
cisco svc1 [5]
cisco svc2 [3]

Five: Group based on
vendor-name(2nd field),
serial-number(3rd field) ,
UNIX-time-stamp(4th field) in hour bucketing
and first 4 characters of device-class (5th field)

$ ls | awk -F "." '
{ $5 = substr($5, 0, 4) }
{count[$2" "$3" "$4-($4%86400)" "$5]++}
END {for(j in count) print j,"["count[j]"]"}'

Output:
cisco 0002 1255046400 svc1 [2]
cisco 0002 1255046400 svc2 [1]
juniper 0001 1255046400 svc1 [1]
cisco 0001 1255046400 svc1 [2]
cisco 0001 1255046400 svc2 [2]
cisco 0001 1254009600 svc1 [1]


Hope you find it useful.

Related post:

- SQL Sum of and group by using awk
- Group by Clause functionality using awk
- Associative array in awk

Wednesday, October 7, 2009

Extract sub-string from variable in bash

Suppose:

$ mypath=/dir1/dir2/dir3/dir4

$ echo $mypath
/dir1/dir2/dir3/dir4

Now, if you need to print the parent path from the above path (i.e. print '/dir1/dir2/dir3')

$ dirname $mypath
/dir1/dir2/dir3

$ parentpath=$(dirname $mypath)

$ echo $parentpath
/dir1/dir2/dir3

Using Sub-string Removal ways in Bash shell

${string%substring}
It deletes shortest match of $substring from 'back' of $string.

$ echo ${mypath%/*}
/dir1/dir2/dir3

or

$ printf '%s\n' "${mypath%/*}"
/dir1/dir2/dir3

If you need to print the last directory name from the above mypath, here are few ways:

Using Sub-string Removal ways in Bash shell
${string##substring}
It deletes the "longest" match of $substring from 'front' of $string.

$ echo ${mypath##*/}
dir4

Another way using awk:

$ echo $mypath | awk '{print $NF}' FS=\/
dir4

Similar post:

- Truncate string using bash script

Friday, October 2, 2009

Directory size excluding sub-directories - Linux

Directory '/home/user/work/demo/' contains a few regular files and two directories say "part2"(size=41236 KB) and "libs"(size=20620 KB).

$ du ~/work/demo/
41236 /home/user/work/demo/part2
20620 /home/user/work/demo/libs
87640 /home/user/work/demo/

From Linux/UNIX DU(1) command man page:

-s, --summarize
display only a total for each argument.

So, the following command is going to display the total size of the directory '/home/user/work/demo/'

$ du -s ~/work/demo/
87640 /home/user/work/demo/

Now, if you need to find the size of the '/home/user/work/demo/' directory excluding the size of the sub-directories, there is a command line option with DU(1):

-S, --separate-dirs
do not include size of sub-directories

So

$ du -S ~/work/demo/
41236 /home/user/work/demo/part2
20620 /home/user/work/demo/libs
25784 /home/user/work/demo/

Now,

$ du -S --max-depth=0 ~/work/demo/
25784 /home/user/work/demo/

or

$ du -S ~/work/demo/ | awk 'END {print}'
25784 /home/user/work/demo/

© Jadu Saikia http://unstableme.blogspot.com