Q: For a sample two-column data below (cont_bd.txt), how to sum the second column and group by the first column:
Input file:
$ cat cont_bd.txt
continent:mval
SA:2345
AF:123
SA:89
OC:890
EU:24
AF:90
NA:5678
AF:345
OC:90
OC:23
SA:1234
EU:90
AF:12
SA:909
Awk solution for group by clause implementation:
$ awk 'BEGIN{FS=":"; print "continent count total avg"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f %10.0f %10.2f\n", i, a[i], b[i], b[i]/a[i])} ' cont_bd.txt
Output:
continent count total avg
NA 1 5678 5678.00
OC 3 1003 334.33
AF 4 570 142.50
SA 4 4577 1144.25
EU 2 114 57.00

18 comments:
Thanks mate, this is gold! I used your ideas in this funky script. The hardest part was grokking how awk parses through a file in one hit!
@ropata, I just saw your script, great work. Thanks for mentioning me. keep in touch.
Thanks for this fantastic little script - I've used it for some Twitter data crunching here.
@The Hat, Welcome :-)
That's really a good analysis. Thanks for mentioning me. Keep in touch.
Hi Jadu Saikia...
your little script..looks good.. i need some enhancement for my requirement. I have failed to that using ur script.. could you help me to do..
I have 3 columns and i need to group up ...
SA 0 2345
AF 0 123
SA 1 89
OC 0 890
EU 1 24
AF 1 90
NA 0 5678
AF 1 345
OC 1 90
OC 0 23
SA 1 1234
EU 0 90
AF 0 12
SA 0 909
I want the results like this....
continent count0 count1 total0 total1
NA 1 0 5678 0
OC 2 1 913 90
AF 2 2 135 435
SA
EU
@manimaran, I have created a post of your query, please check today's post, hope this helps. Thanks for your comment. Keep in touch.
Hi Jadu Saikia,
Man... u r gr8.... thanks a ton.. u saved my life :)... its working...
Hi Jadu Saika,
I need to get the min and max of the available values.. help me if you can.
Expected output:
Continent count0 count1 sum0 sum1 min max
NA 1 0 5678 0 89 5678
OC 2 1 913 90 23 890
AF 2 2 135 435 12 345
SA 2 2 3254 1323 89 2345
EU 1 1 90 24 24 90
@manimaran, please check the comment on this post:
http://unstableme.blogspot.com/2010/03/awk-count-and-sum-on-multiple-columns.html
Hi Jadu Saikia,
is there any way to find the standard deviation of sum0 and sum1 for each row entry ?
Thanks,
-Manimaran
Tasty AWKness, Jadu! Nice work..saved me a bunch of time.
'sodo
Hi,
I had a similar problem to solve:
I would like to count occurences of the string in a log file, and sort them groupped per hour or minute.
As a total newbie to the linux scripting, I had written something like:
grep 'string' file.log | cut -c 1-23
which returned the result similar to this:
2011-08-06@00:13:43,147
2011-08-06@07:47:30,056
2011-08-06@08:48:15,559
2011-08-06@10:19:07,221
2011-08-06@10:21:45,268
Now I would like to group the results by 2011-08-06@?? and count the lines. The result should look like below:
hour count
2011-08-06@07 1
2011-08-06@17 3
2011-08-06@08 1
2011-08-06@18 1
2011-08-06@10 4
I had modified the given script:
awk 'BEGIN{FS=":"; print " hour count"} NR!=1 {a[$1]++;b[$1]=b[$1]+$2}END{for (i in a) printf("%s %10.0f \n", i, a[i])} ' list.txt
but it omitted the first line starting with
2011-08-06@00
After that I would like to pipe the results of the grep command to the oneliner which does the needed grouping
(I would like to avoid putting scripts into the system, it's rather preferred to copy-paste commands)
Please help :)
John
@John, thanks for the question.
The first line is omitted in your case as in the awk command, you had the statement
NR!=1
NR is nothing but the line number.
e.g.
$ cat file.txt
2011-08-06@00:13:43,147
2011-08-06@07:47:30,056
2011-08-06@08:48:15,559
2011-08-06@10:19:07,221
2011-08-06@10:21:45,268
$ awk '{print NR,$0}' file.txt
1 2011-08-06@00:13:43,147
2 2011-08-06@07:47:30,056
3 2011-08-06@08:48:15,559
4 2011-08-06@10:19:07,221
5 2011-08-06@10:21:45,268
$ cat file.txt | awk -F ":" '{a[$1]++} END {for (i in a) print i,a[i]}'
Output:
2011-08-06@07 1
2011-08-06@08 1
2011-08-06@00 1
2011-08-06@10 2
Hope this helps.
Regards,
Jadu
Great Hint.
Hi Jadu,
I have a nearly same kind of problem which need to be solved by Unix shell-scripting(awk scripting is also allowed).
Input File:
===========
1,B.Sc,Calcutta
1,M.Sc,Stanford
2,M.A.,Pune
3,M.Tech,IIT Guwahati
4,B.Tech,Shibpur
4,M.Tech,Jadavpur
5,B.Lib.,Calcutta
6,B.Sc.,Bangalore
The Output File would be like this:-
==============
2,1,B.Sc,Calcutta,M.Sc,Stanford
1,2,M.A.,Pune
1,3,M.Tech,IIT Guwahati
2,4,B.Tech,Shibpur,M.Tech,Jadavpur
1,5,B.Lib.,Calcutta
1,6,B.Sc.,Bangalore
i.e. ,, in a single line for each student.
For e.g.,Student No 1 has 2 qualifications, B.Sc from Calcutta
and M.Sc from Stanford.
Please suggest a time-efficient(as this needs to be simulated for crores of records) & brilliant way of doing this.I will highly appreciate any help from you.
Waiting for your kind & prompt help...
@Saurav, Thanks for asking the question. Hope this wiil help. Thanks.
$ cat in.txt
1,B.Sc,Calcutta
1,M.Sc,Stanford
2,M.A.,Pune
3,M.Tech,IIT Guwahati
4,B.Tech,Shibpur
4,M.Tech,Jadavpur
5,B.Lib.,Calcutta
6,B.Sc.,Bangalore
$ awk '
BEGIN {FS=OFS=","} {cnt[$1]++}
!A[$1] {A[$1] = $0; next}
{A[$1] = A[$1] "," $2 "," $3}
END {for(i in A) {print cnt[i],A[i]}
}' in.txt
2,1,B.Sc,Calcutta,M.Sc,Stanford
1,2,M.A.,Pune
1,3,M.Tech,IIT Guwahati
2,4,B.Tech,Shibpur,M.Tech,Jadavpur
1,5,B.Lib.,Calcutta
1,6,B.Sc.,Bangalore
Dear Jadu,My heartiest thanks for this magnificient piece of script.
@Saurav Bhattacharyya, my pleasure.
Post a Comment