简体   繁体   中英

awk creating column holding number of duplicates based on specific columns data

In the following data.txt file, values in the 2nd and 3rd columns are repeating within several rows (although rows are not identical):

cat data.txt > 
Julie   Andrews jand    109
Julie   Andrews jand    119
John    Thomas  jd      301
Alex    Tremble atrem   415
Alex    Tremble atrem   3415
Alan    Tremble atrem   215
John    Tomas   jd      302
John    Tomas   jd      3302
John    Tomas   jd      3402
John    Tomas   jd      33302
Alex    Trebe   atrem   416

How to add a 5th column designating the maximal number of repetitions, basing on columns 2 & 3 content, per each row? such as that desired output would look like this:

cat desired.output.txt > 
Julie   Andrews jand    109     2
Julie   Andrews jand    119     2
John    Thomas  jd      301     1
Alex    Tremble atrem   415     3
Alex    Tremble atrem   3415    3
Alan    Tremble atrem   215     3
John    Tomas   jd      302     4
John    Tomas   jd      3302    4
John    Tomas   jd      3402    4
John    Tomas   jd      33302   4
Alex    Trebe   atrem   416     1

Currently I have the following command, which creates a simple counter per each replica (however, this is not the desired output):

awk -F "\t" '{OFS="\t"}{print $0,++cnt[$2,$3]}' data.txt
Julie   Andrews jand    109     1
Julie   Andrews jand    119     2
John    Thomas  jd  301 1
Alex    Tremble atrem   415 1
Alex    Tremble atrem   3415    2
Alan    Tremble atrem   215 3
John    Tomas   jd  302 1
John    Tomas   jd  3302    2
John    Tomas   jd  3402    3
John    Tomas   jd  33302   4
Alex    Trebe   atrem   416 1

easiest and for unsorted files is double scanning the input file

$ awk -v OFS='\t' 'NR==FNR {count[$2,$3]++; next} 
                           {print $0, count[$2,$3]}' file{,}

Julie   Andrews jand    109     2
Julie   Andrews jand    119     2
John    Thomas  jd      301     1
Alex    Tremble atrem   415     3
Alex    Tremble atrem   3415    3
Alan    Tremble atrem   215     3
John    Tomas   jd      302     4
John    Tomas   jd      3302    4
John    Tomas   jd      3402    4
John    Tomas   jd      33302   4
Alex    Trebe   atrem   416     1

if your file is sorted or too big, you can collect all the entries and print with counts when the context changes.

ps. Note that file{,} is a bash shorthand for file file to process the same file twice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM