使用Perl清理具有一个或多个重复项的文件系统

Question

I have two disks, one an ad-hoc backup disk, which is a mess with duplicates everywhere and another disk in my laptop which is an equal mess. 我有两个磁盘，一个是临时备份磁盘，到处都是重复的乱七八糟的东西，而笔记本电脑中的另一个磁盘也一样。 I need to backup unique files and delete duplicates. 我需要备份唯一文件并删除重复项。 So, I need to do the following: 因此，我需要执行以下操作：

Find all non-zero size files 查找所有非零大小的文件
Calculate the MD5 digest of all files 计算所有文件的MD5摘要
Find files with duplicate file names 查找文件名重复的文件
Separate unique files, from master and other copies. 将唯一文件与主副本和其他副本分开。

With the output of this script I will: 使用此脚本的输出，我将：

Backup the unique and master files 备份唯一文件和主文件
Delete the other copies 删除其他副本

Unique file = no other copies 唯一文件 =没有其他副本

Master copy = first instance, where other copies exist, possibly matching preferential path 主副本 =第一个实例，如果存在其他副本，则可能匹配优先路径

Other copies = not master copies 其他副本 =非原版

I've created the appended script, which seems to make sense to me, but: 我创建了附加脚本，这对我来说似乎很有意义，但是：

total files != unique files + master copies + other copies 文件总数！=唯一文件+主副本+其他副本

I have two questions: 我有两个问题：

Where's the error in my logic? 我的逻辑错误在哪里？
Is there a more efficient way of doing this? 有更有效的方法吗？

I chose disk hashes, so that I don't run out of memory when processing enormous file lists. 我选择了磁盘哈希，以便在处理大量文件列表时不会耗尽内存。

#!/usr/bin/perl

use strict;
use warnings;
use DB_File;
use File::Spec;
use Digest::MD5;

my $path_pref = '/usr/local/bin';
my $base = '/var/backup/test';

my $find = "$base/find.txt";
my $files = "$base/files.txt";

my $db_duplicate_file = "$base/duplicate.db";
my $db_duplicate_count_file = "$base/duplicate_count.db";
my $db_unique_file = "$base/unique.db";
my $db_master_copy_file = "$base/master_copy.db";
my $db_other_copy_file = "$base/other_copy.db";

open (FIND, "< $find");
open (FILES, "> $files");

print "Extracting non-zero files from:\n\t$find\n";
my $total_files = 0;
while (my $path = <FIND>) {
  chomp($path);
  next if ($path =~ /^\s*$/);
  if (-f $path && -s $path) {
    print FILES "$path\n";
    $total_files++;
    printf "\r$total_files";
  }
}

close(FIND);
close(FILES);
open (FILES, "< $files");

sub compare {
  my ($key1, $key2) = @_;
  $key1 cmp $key2;
}

$DB_BTREE->{'compare'} = \&compare;

my %duplicate_count = ();

tie %duplicate_count, "DB_File", $db_duplicate_count_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_duplicate_count_file: $!\n";

my %unique = ();

tie %unique, "DB_File", $db_unique_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_unique_file: $!\n";

my %master_copy = ();

tie %master_copy, "DB_File", $db_master_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_master_copy_file: $!\n";

my %other_copy = ();

tie %other_copy, "DB_File", $db_other_copy_file, O_RDWR|O_CREAT, 0666, $DB_BTREE
     or die "Cannot open $db_other_copy_file: $!\n";

print "\nFinding duplicate filenames and calculating their MD5 digests\n";

my $file_counter = 0;
my $percent_complete = 0;

while (my $path = <FILES>) {

  $file_counter++;

  # remove trailing whitespace
  chomp($path);

  # extract filename from path
  my ($vol,$dir,$filename) = File::Spec->splitpath($path);

  # calculate the file's MD5 digest
  open(FILE, $path) or die "Can't open $path: $!";
  binmode(FILE);
  my $md5digest = Digest::MD5->new->addfile(*FILE)->hexdigest;
  close(FILE);

  # filename not stored as duplicate
  if (!exists($duplicate_count{$filename})) {
    # assume unique
    $unique{$md5digest} = $path;
    # which implies 0 duplicates
    $duplicate_count{$filename} = 0;
  }
  # filename already found
  else {
    # delete unique record
    delete($unique{$md5digest});
    # second duplicate
    if ($duplicate_count{$filename}) {
      $duplicate_count{$filename}++;
    }
    # first duplicate
    else {
      $duplicate_count{$filename} = 1;
    }
    # the master copy is already assigned
    if (exists($master_copy{$md5digest})) {
      # the current path matches $path_pref, so becomes our new master copy
      if ($path =~ qq|^$path_pref|) {
        $master_copy{$md5digest} = $path;
      }
      else {
        # this one is a secondary copy
        $other_copy{$path} = $md5digest;
        # store with path as key, as there are duplicate digests
      }
    }
    # assume this is the master copy
    else {
      $master_copy{$md5digest} = $path;
    }
  }
  $percent_complete = int(($file_counter/$total_files)*100);
  printf("\rProgress: $percent_complete %%");
}

close(FILES);    

# Write out data to text files for debugging

open (UNIQUE, "> $base/unique.txt");
open (UNIQUE_MD5, "> $base/unique_md5.txt");

print "\n\nUnique files: ",scalar keys %unique,"\n";

foreach my $key (keys %unique) {
  print UNIQUE "$key\t", $unique{$key}, "\n";
  print UNIQUE_MD5 "$key\n";
}

close UNIQUE;
close UNIQUE_MD5;

open (MASTER, "> $base/master_copy.txt");
open (MASTER_MD5, "> $base/master_copy_md5.txt");

print "Master copies: ",scalar keys %master_copy,"\n";

foreach my $key (keys %master_copy) {
  print MASTER "$key\t", $master_copy{$key}, "\n";
  print MASTER_MD5 "$key\n";
}

close MASTER;
close MASTER_MD5;

open (OTHER, "> $base/other_copy.txt");
open (OTHER_MD5, "> $base/other_copy_md5.txt");

print "Other copies: ",scalar keys %other_copy,"\n";

foreach my $key (keys %other_copy) {
  print OTHER $other_copy{$key}, "\t$key\n";
  print OTHER_MD5 "$other_copy{$key}\n";
}

close OTHER;
close OTHER_MD5;

print "\n";

untie %duplicate_count;
untie %unique;
untie %master_copy;
untie %other_copy;

print "\n";

Answer 1

Looking at the algorithm, I think I see why you are leaking files. 查看算法，我想我明白了您为什么泄漏文件。 The first time you encounter a file copy, you label it "unique": 第一次遇到文件副本时，将其标记为“唯一”：

if (!exists($duplicate_count{$filename})) {
   # assume unique
   $unique{$md5digest} = $path;
   # which implies 0 duplicates
   $duplicate_count{$filename} = 0;
}

The next time, you delete that unique record, without storing the path: 下次，您删除该唯一记录，而不存储路径：

 # delete unique record
delete($unique{$md5digest});

So whatever filepath was at $unique{$md5digest}, you've lost it, and won't be included in unique+other+master. 因此，无论$ unique {$ md5digest}中的文件路径是什么，您都会丢失它，并且不会将其包含在unique + other + master中。

You'll need something like: 您将需要以下内容：

if(my $original_path = delete $unique{$md5digest}) {
    // Where should this one go?
}

Also, as I mentioned in a comment above, IO::File would really clean up this code. 另外，正如我在上面的评论中提到的那样， IO :: File确实可以清理此代码。

Answer 2

This isn't really a response to the larger logic of the program, but you should be checking for errors in open every time (and while we're at it, why not use the more modern form of open with lexical filehandles and three arguments): 这并不是对程序更大逻辑的真正回应，但是您应该每次都检查open错误（而在我们这样做时，为什么不使用带有词法文件句柄和三个参数的更现代形式的open ）：

open my $unique, '>', "$base/unique.txt"
  or die "Can't open $base/unique.txt for writing: $!";

If you don't want to explicitly ask each time, you could also check out the autodie module. 如果您不想每次都明确询问，也可以签出autodie模块。

Answer 3

One apparent optimization is to use file size as an initial comparison basis, and only computer MD5 for files below a certain size or if you have a collision of two files with the same size. 一种明显的优化方法是使用文件大小作为初始比较的基础，并且仅将计算机MD5用于小于特定大小的文件，或者如果两个相同大小的文件发生冲突。 The larger a given file is on disc, the more costly the MD5 computation, but also the less likely its exact size will conflict with another file on the system. 磁盘上给定文件越大，MD5计算成本就越高，但是其确切大小与系统上另一个文件冲突的可能性也就越小。 You can probably save yourself a lot of runtime that way. 这样您可以节省很多运行时间。

You also might want to consider changing your approach for certain kinds of files that contain embedded meta-data that might change without changing the underlying data, so you can find additional dupes even if the MD5's don't match. 您可能还想考虑对某些包含嵌入式元数据的文件进行更改，而这些文件可能会在不更改基础数据的情况下进行更改，因此即使MD5不匹配，您也可以找到其他重复项。 I'm speaking of course of MP3 or other music files that have metadata tags that might be updated by classifiers or player programs, but which otherwise contain the same audio bits. 我说的当然是MP3或其他具有元数据标签的音乐文件，这些元数据标签可能会由分类器或播放器程序更新，但否则包含相同的音频位。

Answer 4

See here for related data on solutions in the abstract nature. 有关抽象解决方案的相关数据，请参见此处。

https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer https://stackoverflow.com/questions/405628/what-is-the-best-method-to-remove-duplicate-image-files-from-your-computer

IMPORTANT Note , as much as we'd like to believe 2 files with the same MD5 are the same file, that is not necessarily true. 重要说明 ，尽管我们希望相信两个具有相同MD5的文件都是同一文件，但这不一定是正确的。 If your data means anything to you, once you've broken it down to a list of candidates that MD5 tells you are the same file, you need to run through every bit of those files linearly to check they are in fact the same. 如果您的数据对您来说意味着什么，将其分解为MD5告诉您相同文件的候选列表后，您需要线性浏览这些文件的每一位以检查它们是否确实相同。

Put this way, given a hash function ( which MD5 is ) of size 1 bits, there are only 2 possible combination's. 这样说，给定一个大小为1位的哈希函数（MD5是MD5），只有2种可能的组合。

0 1

if your hash function told you 2 files both returned a "1" you would not assume they are the same file. 如果您的哈希函数告诉您2个文件都返回“ 1”，则您不会认为它们是同一文件。

Given a hash of 2 bits, there are only 4 possible combination's, 给定2位的散列，只有4种可能的组合，

 00  01 10 11

2 Files returning the same value you would not assume to be the same file. 2返回相同值的文件，您将不会认为它们是同一文件。

Given a hash of 3 bits, there are only 8 possible combinations 给定3位哈希，只有8种可能的组合

 000 001 010 011 
 100 101 110 111

2 files returning the same value you would not assume to be the same file. 2个文件返回相同的值，您将不会认为它们是同一文件。

This pattern goes on in ever increasing amounts, to a point that people for some bizarre reason start putting "chance" into the equation. 这种模式不断增加，以至于人们出于某种奇怪的原因开始将“机会”加入到方程式中。 Even at 128 bits ( MD5 ), 2 files sharing the same hash does not mean they are in fact the same file. 即使在128位（MD5），2个文件共享相同散列值并不意味着他们其实都是同一个文件。 the only way to know is by comparing every bit. 唯一知道的方法是比较每一位。

There is a minor optimization that occurs if you read them start to end, because you can stop reading as soon as you find a differing bit, but to confirm identical, you need to read every bit. 如果您从头到尾阅读它们，则会发生次要的优化 ，因为一旦发现不同的位，您就可以停止阅读，但是要确认相同，则需要阅读每一位。

使用Perl清理具有一个或多个重复项的文件系统

问题描述

4 个解决方案

解决方案1
2 2009-06-08 20:19:26

解决方案2
1 2009-06-08 19:57:05

解决方案3
0 2009-06-08 19:42:58

解决方案4
0 2009-06-09 02:31:20

使用Perl清理具有一个或多个重复项的文件系统

问题描述

4 个解决方案

解决方案1 2 2009-06-08 20:19:26

解决方案2 1 2009-06-08 19:57:05

解决方案3 0 2009-06-08 19:42:58

解决方案4 0 2009-06-09 02:31:20

解决方案1
2 2009-06-08 20:19:26

解决方案2
1 2009-06-08 19:57:05

解决方案3
0 2009-06-08 19:42:58

解决方案4
0 2009-06-09 02:31:20