[英]Remove duplicates ignoring specific columns
I want to remove all duplicates from a file but ignoring the first 2 columns, I mean don't comparing those columns.我想从文件中删除所有重复项但忽略前两列,我的意思是不要比较这些列。
This is my example input:这是我的示例输入:
111 06:22 apples, bananas and pears
112 06:28 bananas
113 07:07 apples, bananas and pears
114 07:23 apples and bananas
115 08:01 bananas and pears
116 08:23 pears
117 09:22 apples, bananas and pears
118 12:23 apples and bananas
I want this output:我想要这个 output:
111 06:22 apples, bananas and pears
112 06:28 bananas
114 07:23 apples and bananas
115 08:01 bananas and pears
116 08:23 pears
I've tried this bellow, but it only compares the third column and ignores the rest of the line:我试过这个波纹管,但它只比较第三列并忽略该行的 rest:
awk '!seen[$3]++' sample.txt
Store $0
to a temporary variable, set $1
and $2
to empty, then use newly composed $0
as key:将
$0
存储到一个临时变量中,将$1
和$2
设置为空,然后使用新组合的$0
作为键:
awk '{ t = $0; $1 = $2 = "" } !seen[$0]++ { print t }' sample.txt
You might use substr
string function to get desired part of line for comparison, let file.txt
content be您可以使用
substr
string function来获取所需的行部分进行比较,让file.txt
内容为
111 06:22 apples, bananas and pears
112 06:28 bananas
113 07:07 apples, bananas and pears
114 07:23 apples and bananas
115 08:01 bananas and pears
116 08:23 pears
117 09:22 apples, bananas and pears
118 12:23 apples and bananas
then然后
awk '!arr[substr($0,11)]++' file.txt
gives output给出 output
111 06:22 apples, bananas and pears
112 06:28 bananas
114 07:23 apples and bananas
115 08:01 bananas and pears
116 08:23 pears
Explanation: get lines which are unique by getting substring of whole line ( $0
) starting at 11th character.说明:通过获取从第 11 个字符开始的整行 (
$0
) 的 substring 来获取唯一的行。
(tested in GNU Awk 5.0.1) (在 GNU Awk 5.0.1 中测试)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.