简体   繁体   English

在目录中按名称查找文件的重复项-Linux

[英]Find duplicates of a file by name in a directory recursively - Linux

I have a folder which contains sub folders and some more files in them. 我有一个包含子文件夹和其中更多文件的文件夹。

The files are named in the following way 这些文件以以下方式命名

abc.DEF.xxxxxx.dat

I'm trying to find the duplicate files only matching 'xxxxxx' in the above pattern ignoring the rest. 我试图在上述模式中查找仅与“ xxxxxx”匹配的重复文件,而忽略其余文件。 The extension .dat doesn't change. 扩展名.dat不变。 But the length of abc and DEF might change. 但是abc和DEF的长度可能会改变。 The order of separation by periods also doesn't change. 按句点分隔的顺序也不会更改。

I'm guessing I need to use Find in the following way 我猜我需要以以下方式使用查找

find -regextype posix-extended -regex '\w+\.\w+\.\w+\.dat'

I need help coming up with the regular expression. 我需要有关正则表达式的帮助。 Thanks. 谢谢。

Example: For a file named 'epg.ktt.crwqdd.dat', I need to find duplicate files containing 'crwqdd'. 示例:对于名为“ epg.ktt.crwqdd.dat”的文件,我需要查找包含“ crwqdd”的重复文件。

You can use awk for that: 您可以使用awk

find /path -type f -name '*.dat' | awk -F. 'a[$4]++'

Explanation: 说明:

Let find give the following output: 让我们find以下输出:

./abd.DdF.TTDFDF.dat
./cdd.DxdsdF.xxxxxx.dat
./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

Basically, spoken with the words of a computer, you want to count the occurrences of a pattern between .dat and the next dot and print those lines where pattern appeared at least the second time. 基本上,用计算机的语言来说,您希望计算.dat和下一个点之间的图案出现次数,并打印出至少第二次出现图案的行。

To achieve this we split the file names by the . 为此,我们用。分割文件名. what gives us 5(!) fields: 是什么赋予我们5(!)个字段:

 echo ./abd.DEF.xxxxxx.dat | awk -F. '{print $1 " " $2 " " $3 " " $4  " " $5}'
  /abd DEF xxxxxx dat

Note the first, empty field. 请注意第一个空白字段。 The pattern of interest is $4 . 利息模式为$4

To count the occurrences of a pattern in $4 we use an associative array a and increment it's value on each occurrence. 要计算$4中某个模式的出现次数,我们使用一个关联数组a并在每次出现时增加其值。 Unoptimized, the awk command will look like: 未经优化, awk命令将如下所示:

... | awk -F. '{{if(a[$4]++ > 1){print}}'

However, you can write an awk program in the form: 但是,您可以使用以下形式编写awk程序:

CONDITION { ACTION }

What will give us: 什么会给我们:

... | awk -F. 'a[$4]++ > 1 {print}'

print is the default action in awk . printawk的默认操作。 It prints the whole current line. 它打印整个当前行。 As it is the default action it can be omitted. 由于它是默认操作,因此可以省略。 Also the >1 check can be omitted because awk treats integer values greater than zero as true . 也可以省略>1检查,因为awk将大于零的整数值视为true This gives us the final command: 这给了我们最终的命令:

... | awk -F. 'a[$4]++' 

To generalize the command we can say the pattern of interest isn't the 4th column, it is the next to last column. 为了概括该命令,我们可以说感兴趣的模式不是第四列,它是倒数第二列。 This can be expressed using number of fields in awk its NF : 这可以用awkNF 的字段数表示:

... | awk -F. 'a[$(NF-1)]++'

Output: 输出:

./abc.DEF.xxxxxx.dat
./abd.DdF.xxxxxx.dat
./abd.DEF.xxxxxx.dat

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM