使用awk计算统计信息

Question

I have a question regarding to my previous successfully answered question here by @fedorgui. 我对@fedorgui 在此之前成功回答的问题有疑问。

I have a table: 我有一张桌子：

pac1 xxx 
pac1 yyy
pac1 zzz
pac2 xxx
pac2 uuu
pac3 zzz
pac3 uuu
pac4 zzz

And I need to calculate output like this: 我需要这样计算输出：

pac1 xxx 2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 2/4
pac3 uuu 2/4
pac4 zzz 3/4

Where first number is unique occurrences in column two / unique occurrences in column one (in this case xxx occurs 2 in column two and uniq column one is 4 => 2/4 其中第一个数字是第二列中的唯一出现/第一列中的唯一出现（在这种情况下，xxx在第二列中出现2，而uniq第一列是4 => 2/4

Solution works in awk is here: 解决方案在awk中的作用在这里：

$ awk 'FNR==NR {col1[$1]++; col2[$2]++; next} {print $0, col2[$2] "/" length(col1)}' file file

But my input could have duplicated rows like: 但是我的输入可能有重复的行，例如：

pac1 xxx
pac1 xxx 
pac1 xxx  
pac1 yyy
pac1 zzz
pac2 xxx
pac2 xxx
pac2 xxx
pac2 uuu
pac3 zzz
pac3 uuu
pac4 zzz
pac4 zzz

And I need to do the same computations but only for uniq rows and add this statistic to all rows like (do not compute duplications rows): 我只需要对uniq行进行相同的计算，并将此统计信息添加到所有行，例如（不计算重复行）：

pac1 xxx 2/4
pac1 xxx 2/4
pac1 xxx 2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 2/4
pac3 uuu 2/4
pac4 zzz 3/4
pac4 zzz 3/4

This is more complicated I have thousands of rows. 这更复杂了，我有数千行。 Thank you for any idea. 谢谢你的任何想法。

Answer 1

Just check if the line is unique when adding to the second array. 在添加到第二个数组时，只需检查该行是否唯一。

awk 'FNR==NR{a[$1];b[$2]+=!c[$1,$2]++;next}{print $0, b[$2] "/" length(a)}' test{,}

pac1 xxx 2/4
pac1 xxx  2/4
pac1 xxx   2/4
pac1 yyy 1/4
pac1 zzz 3/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 xxx 2/4
pac2 uuu 2/4
pac3 zzz 3/4
pac3 uuu 2/4
pac4 zzz 3/4
pac4 zzz 3/4

or if there aren't random spaces at the end of lines like your example you could just use $0 instead of $1,$2 或者如果像示例一样在行尾没有随机空格，则可以只使用$0代替$1,$2

使用awk计算统计信息

问题描述

1 个解决方案

解决方案1
5 已采纳 2017-06-20 16:20:28

使用awk计算统计信息

问题描述

1 个解决方案

解决方案1 5 已采纳 2017-06-20 16:20:28

解决方案1
5 已采纳 2017-06-20 16:20:28