简体   繁体   English

基本的grep / sed / awk脚本查找重复项

[英]Basic grep/sed/awk script to find duplicates

I'm starting out with regular expressions and grep and I want to find out how to do this. 我从正则表达式和grep开始,我想知道如何做到这一点。 I have this list: 我有这个清单:

1. 12493 6530
2. 12475 5462
3. 12441 5450
4. 12413 5258
5. 12478 4454
6. 12416 3859
7. 12480 3761
8. 12390 3746
9. 12487 3741
10. 12476 3557
...

And I want to get the contents of the middle column only (so NF==2 in awk?). 而且我只想获取中间列的内容(所以awk中的NF == 2?)。 The delimiter here is a space. 这里的分隔符是一个空格。

I then want to find which numbers are there more than once (duplicates). 然后,我想查找哪些数字不止一次(重复)。 How would I go about doing that? 我将如何去做? Thank you, I'm a beginner. 谢谢,我是初学者。

Using : 使用

awk '{count[$2]++}END{for (a in count) {if (count[a] > 1 ) {print a}}}' file

But you don't have duplicate numbers in the 2nd column. 但是第二列中没有重复的数字。

  • the second column in awk is $2 awk的第二列是$2
  • count[$2]++ increment an array value with the treated number as key count[$2]++以已处理的数字作为键递增数组值
  • the END block is executed @the end, and we test each array values to find those having +1 END块在END执行,我们测试每个数组的值以找到具有+1的值

And with a better concision (credits for jthill ) 并且具有更好的简洁性( jthill的积分)

awk '++count[$2]==2{print $2}' file

Using perl: 使用perl:

perl -anE '$h{$F[1]}++; END{ say for grep $h{$_} > 1, keys %h }'

Iterate the lines and build a hash ( %h / $h{...} ) with the count ( ++ ) of the second column values ( $F[1] ), and after that ( END{ ... } ) say all hash key s with count ( $h{$_} ) which is > 1 . 对行进行迭代,并使用第二个列值( $F[1] )的计数( ++ )构建一个哈希( %h / $h{...} ),然后再构建一个( END{ ... }say计数( $h{$_}> 1所有哈希key s。

With the data stored in test, 数据存储在测试中

Using a combination of awk, uniq and grep commands 结合使用awk,uniq和grep命令

 cat test | awk -v x=2 '{print $x}' | sort | uniq -c | sed  '/^1 /d' | awk -v x=2 '{print $x}'

Explanation: 说明:

awk -v x=2 '{print $x}'

selects 2nd column 选择第二列

uniq -c 

counts the appearance of each number 计算每个数字的出现

sed  '/^1 /d'

deletes all the entries with only one appearance 删除仅出现一次的所有条目

awk -v x=2 '{print $x}'

removes the number count with awk again 再次用awk删除数字计数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM