简体繁体 English

在文档管理应用程序中检查文档重复项和类似文档

[英]Checking for document duplicates and similar documents in a document management application

原文 2009-11-13 12:36:55 5 2 php/ linux/ duplicates/ document-management/ dms

Update: I have now written a PHP extension called php_ssdeep for the ssdeep C API to facilitate fuzzy hashing and hash comparisons in PHP natively. 更新：我现在已经为ssdeep C API编写了一个名为php_ssdeep的PHP扩展，以便在本机中简化PHP中的模糊散列和散列比较。 More information can be found over at my blog . 可以在我的博客上找到更多信息。 I hope this is helpful to people. 我希望这对人们有所帮助。

I am involved in writing a custom document management application in PHP on a Linux box that will store various file formats (potentially 1000's of files) and we need to be able to check whether a text document has been uploaded before to prevent duplication in the database. 我参与在Linux上编写一个PHP自定义文档管理应用程序，它将存储各种文件格式（可能是1000个文件），我们需要能够检查文本文档是否已经上传，以防止数据库中的重复。

Essentially when a user uploads a new file we would like to be able to present them with a list of files that are either duplicates or contain similar content. 基本上，当用户上传新文件时，我们希望能够向他们提供重复或包含类似内容的文件列表。 This would then allow them to choose one of the pre-existing documents or continue uploading their own. 然后，这将允许他们选择一个预先存在的文档或继续上传他们自己的文档。

Similar documents would be determined by looking through their content for similar sentances and perhaps a dynamically generated list of keywords. 类似的文件将通过查看他们的相似发送的内容以及可能是动态生成的关键字列表来确定。 We can then display a percentage match to the user to help them find the duplicates. 然后，我们可以向用户显示百分比匹配，以帮助他们找到重复项。

Can you recommend any packages for this process and any ideas of how you might have done this in the past? 你能推荐一下这个过程的包吗？以及你过去如何做到这一点的想法？

The direct duplicate I think can be done by getting all the text content and 我认为可以通过获取所有文本内容来完成直接复制

Stripping whitespace 剥离空白
Removing punctuation 删除标点符号
Convert to lower or upper case 转换为大写或小写

then form an MD5 hash to compare with any new documents. 然后形成MD5哈希以与任何新文档进行比较。 Stripping those items out should help prevent dupes not being found if the user edits a document to add in extra paragraph breaks for example. 如果用户编辑文档以添加额外的段落中断，则剥离这些项目应有助于防止找不到欺骗。 Any thoughts? 有什么想法吗？

This process could also potentially run as a nightly job and we could notify the user of any duplicates when they next login if the computational requirement is too great to run in realtime. 此过程也可能作为夜间作业运行，如果计算要求太大而无法实时运行，我们可以在下次登录时通知用户任何重复项。 Realtime would be preferred however. 然而，实时将是首选。

2 个解决方案

I have found a program that does what its creator, Jesse Kornblum, calls "Fuzzy Hashing". 我找到了一个程序，它的创造者Jesse Kornblum称之为“模糊哈希”。 Very basically it makes hashes of a file that can be used to detect similar files or identical matches. 基本上它会使文件的哈希值可用于检测类似文件或相同匹配。

The theory behind it is documented here: Identifying almost identical files using context triggered piecewise hashing 这里记录了它背后的理论：使用上下文触发的分段散列来识别几乎相同的文件

ssdeep is the name of the program and it can be run on Windows or Linux. ssdeep是程序的名称，可以在Windows或Linux上运行。 It was intended for use in forensic computing, but it seems suited enough to our purposes. 它旨在用于法医计算，但它似乎足以满足我们的目的。 I have done a short test on an old Pentium 4 machine and it takes about 3 secs to go through a hash file of 23MB (hashes for just under 135,000 files) looking for matches against two files. 我在旧的Pentium 4机器上做了一个简短的测试，需要大约3秒才能通过一个23MB的哈希文件（哈希值只有不到135,000个文件）来查找两个文件的匹配。 That time includes creating hashes for the two files I was searching against as well. 那段时间包括为我正在搜索的两个文件创建哈希值。

I'm working on a similar problem in web2project and after asking around and digging, I came to the conclusion of "the user doesn't care". 我正在研究web2project中的类似问题，在询问并挖掘之后，我得出结论“用户不关心”。 Having duplicate documents doesn't matter to the user as long as they can find their own document by its own name. 有重复的文件不只要他们能找到自己的名称自己的重要文件给用户。

That being said, here's the approach I'm taking: 话虽这么说，这是我采取的方法：

Allow a user to upload a document associating it with whichever Projects/Tasks they want; 允许用户上传与其关联的文档与他们想要的任何项目/任务;
The file should be renamed to prevent someone getting at it via http.. or better stored outside the web root. 该文件应该重命名，以防止有人通过http ..或更好地存储在Web根目录之外。 The user will still see their filename in the system and if they download it, you can set the headers with the "proper" filename; 用户仍然会在系统中看到他们的文件名，如果他们下载了它，你可以用“正确”的文件名设置标题;
At some point in the future, process the document to see if there are duplicates.. at this point though, we are not modifying the document. 在将来的某个时刻，处理文档以查看是否存在重复..但此时，我们不会修改文档。 After all, there could be important reasons the whitespace or capitalization is changed; 毕竟，可能有重要原因改变空白或大写;
If there are dupes, delete the new file and then link to the old one; 如果有傻瓜，删除新文件，然后链接到旧文件;
If there aren't dupes, do nothing; 如果没有欺骗，什么都不做;
Index the file for search terms - depending on the file format, there are lots of options, even for Word docs; 索引搜索术语的文件 - 根据文件格式，有很多选项，即使是Word文档;

Throughout all of this, we don't tell the user it was a duplicate... they don't care. 在所有这些中，我们不会告诉用户它是重复的......他们并不关心。 It's us (developers, db admins, etc) that care. 我们（开发人员，数据库管理员等）关心。

And yes, this works even if they upload a new version of the file later. 是的，即使他们稍后上传了该文件的新版本，这仍然有效。 First, you delete the reference to the file, then - just like in garbage collection - you only delete the old file if there are zero references to it. 首先，删除对文件的引用，然后 - 就像在垃圾收集中一样 - 只有在没有引用它的情况下才删除旧文件。