I have a multi-field text file. I'd like to have a command that would combine both the behavior of both sort -n -u -k
and uniq -c
- that is, sort the file on a certain key filed and the provide the number of duplicates prepended or postponed to the original line. At the moment, I either sort on the certain key and obtain the first of the duplicated lines without the number of duplicates with sort -n -u -k
or count the number of duplicates with uniq -c
by extracting the key field.
Can you suggest a command with both behavior implemented?
An example of the file (the key column can be any of the specified):
4549 1 22656489 63452157 3235 1116 612 532275 6009800 534075 6012488 477375 5995844 533175 6011144 8388615 236
4549 2 22656489 63452158 3214 1116 613 532275 6009825 534075 6012488 477375 5995831 533175 6011157 8388615 236
4549 3 22656489 63452159 3193 1116 614 532275 6009850 534075 6012488 477375 5995819 533175 6011169 8388615 236
4549 4 22656489 63452160 3173 1116 615 532275 6009875 534075 6012488 477375 5995806 533175 6011182 8388615 235
4549 5 22656489 63452161 3152 1116 616 532275 6009900 534075 6012488 477375 5995794 533175 6011194 8388615 235
4549 6 22656489 63452162 3131 1116 617 532275 6009925 534075 6012488 477375 5995781 533175 6011207 8388615 235
4549 7 22656489 63452163 3111 1116 618 532275 6009950 534075 6012488 477375 5995769 533175 6011219 8388615 235
4549 8 22656489 63452164 3091 1116 619 532275 6009975 534075 6012488 477375 5995756 533175 6011232 8388615 234
4549 9 22656489 63452165 3070 1116 620 532275 6010000 534075 6012488 477375 5995744 533175 6011244 8388615 234
4549 10 22656489 63452166 3050 1116 621 532275 6010025 534075 6012488 477375 5995731 533175 6011257 8388615 234
4549 11 22656489 63452167 3030 1116 622 532275 6010050 534075 6012488 477375 5995719 533175 6011269 8388615 234
As I currently understand it, you want to specify one or more columns to use as a key and obtain a result with each output line showing the multiplicity for that key. In that case, suppose your data is in a file called "data" and we want column 17 as the key:
$ awk '{print $17}' data | sort -n | uniq -c
4 234
4 235
3 236
Thus, the value of 236 appears in column 17 a total of 3 times in your test data. Or, suppose you wanted columns 6, 8, 1, and 3 as the key (and in that order):
$ awk '{print $6,$8,$1,$3}' data | sort -n | uniq -c
11 1116 532275 4549 22656489
For this key, all 11 lines are dups.
This approach has three steps. First, we have awk
select the columns you want in the order you want. Second, sort -n
sorts them numerically on the keys. Lastly, uniq
counts duplicates.
UPDATE: Suppose, as above, we want to use columns 6, 8, 1, and 3 as the key but, as per your comment, we want keep one of the original lines. In this case we instruct awk to put the original 17 columns before the key, we tell sort to sort on the key (columns 18+) and then we instruct uniq to ignore those first 17 columns:
awk '{print $0,$6,$8,$1,$3}' data | sort -k18 -n | uniq -f 17 -c
For your sample data, this results in:
11 4549 10 22656489 63452166 3050 1116 621 532275 6010025 534075 6012488 477375 5995731 533175 6011257 8388615 234 1116 532275 4549 22656489
If you only want the original 17 columns printed, then we can use perl to show just the first 17 columns and crop off the key:
awk '{print $0,$6,$8,$1,$3}' data | sort -k18 -n | uniq -f 17 -c | perl -nle '@a=split;print join " ", @a[0..17]'
which results in:
11 4549 10 22656489 63452166 3050 1116 621 532275 6010025 534075 6012488 477375 5995731 533175 6011257 8388615 234
Using decorate-sort-undecorate , you can append to the data the fields you want to base your processing in, do the processing, and remove the extra fields. Eg to sort on fields 17 and 5:
awk '{print $0 OFS $17 OFS $5}' test_s | sort -n -k18 -k19 | uniq -c -f17 | awk '{NF=18;print}'
You first append the key fields, then sort
and uniq
on them, and then only preserve the count added by uniq
and the original fields.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.