简体   繁体   English

awk根据特定的列数据创建包含重复项数的列

[英]awk creating column holding number of duplicates based on specific columns data

In the following data.txt file, values in the 2nd and 3rd columns are repeating within several rows (although rows are not identical): 在以下data.txt文件中,第二和第三列中的值在几行中重复(尽管行不相同):

cat data.txt > 
Julie   Andrews jand    109
Julie   Andrews jand    119
John    Thomas  jd      301
Alex    Tremble atrem   415
Alex    Tremble atrem   3415
Alan    Tremble atrem   215
John    Tomas   jd      302
John    Tomas   jd      3302
John    Tomas   jd      3402
John    Tomas   jd      33302
Alex    Trebe   atrem   416

How to add a 5th column designating the maximal number of repetitions, basing on columns 2 & 3 content, per each row? 如何在第2列和第3列的基础上每行增加第5列以指定最大重复次数? such as that desired output would look like this: 例如所需的输出如下所示:

cat desired.output.txt > 
Julie   Andrews jand    109     2
Julie   Andrews jand    119     2
John    Thomas  jd      301     1
Alex    Tremble atrem   415     3
Alex    Tremble atrem   3415    3
Alan    Tremble atrem   215     3
John    Tomas   jd      302     4
John    Tomas   jd      3302    4
John    Tomas   jd      3402    4
John    Tomas   jd      33302   4
Alex    Trebe   atrem   416     1

Currently I have the following command, which creates a simple counter per each replica (however, this is not the desired output): 当前,我有以下命令,该命令为每个副本创建一个简单的计数器(但是,这不是所需的输出):

awk -F "\t" '{OFS="\t"}{print $0,++cnt[$2,$3]}' data.txt
Julie   Andrews jand    109     1
Julie   Andrews jand    119     2
John    Thomas  jd  301 1
Alex    Tremble atrem   415 1
Alex    Tremble atrem   3415    2
Alan    Tremble atrem   215 3
John    Tomas   jd  302 1
John    Tomas   jd  3302    2
John    Tomas   jd  3402    3
John    Tomas   jd  33302   4
Alex    Trebe   atrem   416 1

easiest and for unsorted files is double scanning the input file 对于未排序的文件,最简单的方法是对输入文件进行两次扫描

$ awk -v OFS='\t' 'NR==FNR {count[$2,$3]++; next} 
                           {print $0, count[$2,$3]}' file{,}

Julie   Andrews jand    109     2
Julie   Andrews jand    119     2
John    Thomas  jd      301     1
Alex    Tremble atrem   415     3
Alex    Tremble atrem   3415    3
Alan    Tremble atrem   215     3
John    Tomas   jd      302     4
John    Tomas   jd      3302    4
John    Tomas   jd      3402    4
John    Tomas   jd      33302   4
Alex    Trebe   atrem   416     1

if your file is sorted or too big, you can collect all the entries and print with counts when the context changes. 如果文件已排序或太大,则可以收集所有条目并在上下文更改时打印计数。

ps. ps。 Note that file{,} is a bash shorthand for file file to process the same file twice. 请注意, file{,}file file处理两次相同文件的bash简写形式。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM