简体   繁体   English

删除忽略特定列的重复项

[英]Remove duplicates ignoring specific columns

I want to remove all duplicates from a file but ignoring the first 2 columns, I mean don't comparing those columns.我想从文件中删除所有重复项但忽略前两列,我的意思是不要比较这些列。

This is my example input:这是我的示例输入:

111  06:22  apples, bananas and pears
112  06:28  bananas
113  07:07  apples, bananas and pears
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears
117  09:22  apples, bananas and pears
118  12:23  apples and bananas

I want this output:我想要这个 output:

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

I've tried this bellow, but it only compares the third column and ignores the rest of the line:我试过这个波纹管,但它只比较第三列并忽略该行的 rest:

awk '!seen[$3]++' sample.txt

Store $0 to a temporary variable, set $1 and $2 to empty, then use newly composed $0 as key:$0存储到一个临时变量中,将$1$2设置为空,然后使用新组合的$0作为键:

awk '{ t = $0; $1 = $2 = "" } !seen[$0]++ { print t }' sample.txt

You might use substr string function to get desired part of line for comparison, let file.txt content be您可以使用substr string function来获取所需的行部分进行比较,让file.txt内容为

111  06:22  apples, bananas and pears
112  06:28  bananas
113  07:07  apples, bananas and pears
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears
117  09:22  apples, bananas and pears
118  12:23  apples and bananas

then然后

awk '!arr[substr($0,11)]++' file.txt

gives output给出 output

111  06:22  apples, bananas and pears
112  06:28  bananas
114  07:23  apples and bananas
115  08:01  bananas and pears
116  08:23  pears

Explanation: get lines which are unique by getting substring of whole line ( $0 ) starting at 11th character.说明:通过获取从第 11 个字符开始的整行 ( $0 ) 的 substring 来获取唯一的行。

(tested in GNU Awk 5.0.1) (在 GNU Awk 5.0.1 中测试)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM