简体繁体 English

使用Java进行云中的重复数据删除

[英]Data Deduplication In Cloud WIth Java

原文 2019-06-05 15:23:07 8 1 java/ hash/ duplicates

I am trying to implement a data deduplication program in the cloud using Java. 我正在尝试使用Java在云中实现重复数据删除程序。

I'm not sure how to proceed with the implementation. 我不确定如何继续实施。

First, I wanted to do a simple file compare of the file size, date and name of the file. 首先，我想对文件的大小，日期和名称进行简单的文件比较。 However, this is ineffective since the file might have same content but a different name. 但是，这是无效的，因为文件可能具有相同的内容但名称不同。

I have decided on a simple algorithm which is file upload -> file chunking -> Rabin-karp hashing -> determine to see whether can upload file. 我已经决定了一个简单的算法，即文件上传 - >文件分块 - > Rabin-karp哈希 - >确定是否可以上传文件。

Will this be fine or are there any improvements? 这会没事或有任何改进吗？

Where would I be able to find out more information on this? 我在哪里可以找到更多相关信息？ I have tried looking around the Internet but I can't find anything. 我试过环顾互联网，但我找不到任何东西。 Most of it is just broken down into certain implementations but without explanation or details on file chunking or Rabin-karp hashing. 其中大部分内容只是分解为某些实现，但没有关于文件分块或Rabin-karp散列的解释或细节。

I would want to know about which Java libraries I should look into regarding this program. 我想知道关于这个程序我应该研究哪些Java库。

1 个解决方案

It would be easier if you state your problem constraints. 如果你陈述你的问题限制会更容易。 Assuming the following: 假设如下：

The smallest indivisible unit of data is a file 最小的不可分割的数据单元是文件
Files are reasonably small to fit in memory for computing hashes 文件相当小，不适合计算哈希的内存
Your files are in some cloud bucket or something where you can list them all. 您的文件位于某个云端桶中，或者您可以将其全部列出。 Also that eliminates identical filenames. 这也消除了相同的文件名。

You can probably narrow down your problem. 你可以缩小你的问题范围。

Iterate through all the files in all the files using some fast hashing algorithm like a basic CRC checksum and build a map. 使用一些快速哈希算法（如基本CRC校验和）迭代所有文件中的所有文件并构建映射。 (Can be easily parallelized). （可以轻松并行化）。
Filter out all the files which have a collision. 过滤掉所有发生冲突的文件。 You can easily leave out the rest of the files which for all practical purposes should be a pretty reasonable chunk of the data. 您可以轻松地省略其余的文件，这些文件实际上应该是非常合理的数据块。
Run through this remaining subset of files with a cryptographic hash (or worst case, match the entire files) and identify matches. 使用加密哈希（或最坏情况，匹配整个文件）运行此剩余文件子集并识别匹配。

This can be refined depending on the underlying data. 这可以根据基础数据进行细化。

However, this is how I would approach the problem and given the structure of it; 然而，这就是我如何处理问题并给出其结构; this problem can be easily partitioned and solved in a parallel manner. 这个问题可以很容易地分区并以并行方式解决。 Feel free to elaborate more so that we can reach a good solution. 随意详细说明，以便我们能够找到一个好的解决方案。