简体   繁体   中英

Awk sum columns from 2 different matching fields

I have a dataset with 2 different row identifiers, I would like to get the ratio between 2 separate columns using the 2 different row identifiers and output into a separate file.

For example:

Input

 Avpr1a CG  1 30
 Avpr1a CHG 2 15
 Avpr1a CHH 1 15
 Avpr1a CG  2 25
 Avpr1a CHG 5 15
 Avpr1a CHH 8 15
 BDNF   CG  1 15
 BDNF   CHG 2 15
 BDNF   CHH 3 10
 BDNF   CG  8 20

What i want is based on column $1,$2 ,get the ratio of sum of $3/sum of $4 to obtain the following (for ex. AVPR1a CG 3/55 = 0.05)

Output

 Avpr1a CG  0.05
 Avpr1a CHG 0.233
 Avpr1a CHH 0.3
 BDNF   CG  0.xxx
 BDNF   CHG 0.xxx
 BDNF   CHH 0.xx

You get the idea.

I am currently doing it really stupidly by separately summing the columns, merge and divide

awk '{a[$1,$2]+=$3}END{for(i in a){print i, a[i]}}'
awk '{a[$1,$2]+=$4}END{for(i in a){print i, a[i]}}'
merge
awk and print $3/$4 from intermediate files

Is it possible to achieve what I want to do in a single awk command?

Thank you!

Yes, it is even fairly easy:

awk '{s1[$1,$2] = $1; s2[$1,$2] = $2; s3[$1,$2] += $3; s4[$1,$2] += $4}
     END { for (i in s3) print s1[i], s2[i], s3[i]/s4[i] }' data

Output:

Avpr1a CG 0.0545455
BDNF CHG 0.133333
BDNF CHH 0.3
Avpr1a CHG 0.233333
BDNF CG 0.257143
Avpr1a CHH 0.3

If you don't capture the separate items in s1 and s2 but print i instead, you get output with the \\034 character separating the two name files. You can fix that, with tr for example, but it is simpler not to need to do so.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM