简体   繁体   English

从数百万/数十亿条记录中删除 MongoDB 4 中的重复项

[英]Removing duplicates in MongoDB 4 from millions/billions of records

Im currently building a database which will have a millions even billions of records.我目前正在构建一个数据库,该数据库将拥有数百万甚至数十亿条记录。 The issue is the files that im using are usually like 30GB big and if you combine them there are duplicates records.问题是我使用的文件通常有 30GB 大,如果将它们组合在一起,就会有重复的记录。 I've only got 64GB of ram and it would be not possible to remove the duplicates with loading the lines in to the ram.我只有 64GB 的 ram,并且无法通过将行加载到 ram 来删除重复项。 I've tried the Unique index but inserting gets really slow after a while.我已经尝试过唯一索引,但一段时间后插入变得非常慢。 Is there any way to remove the duplicates in a efficent way?有什么方法可以有效地删除重复项吗?

Record example:记录示例:

{
    "_id": {
        "$oid": "5fabbb10364524e054d629b4"
    },
    "hash": "599e7b7fb49c772d93b7fc96020d9a13",
    "cleartext": "starocean40"
}

You don't have to keep the whole dataset in memory to find duplicates, instead you can just store a set of record hashes.您不必将整个数据集保留在 memory 中即可查找重复项,相反,您可以只存储一组记录哈希值。

MD5, for instance, uses 128-bit hashes.例如,MD5 使用 128 位散列。 Assuming 1000000 records, this amounts to 16MB + some overhead.假设有 1000000 条记录,这相当于 16MB + 一些开销。 Mind you, you would still need to compare the records whose hashes match - it's possible 2 differing records have the same hash.请注意,您仍然需要比较哈希值匹配的记录 - 可能有 2 个不同的记录具有相同的 hash。

So, when importing the files you would compute a hash of each of the records, check the Python set of previously-seen hashes.因此,在导入文件时,您将计算每条记录的 hash,检查 Python 组先前看到的哈希值。

If a matching hash is found, you would scan the whole DB to double-check a matching record exists.如果找到匹配的 hash,您将扫描整个数据库以仔细检查是否存在匹配记录。

If a matching hash is not found, you can be 100% sure this record was not yet imported, so you can import it and store its hash into the in-memory set of hashes.如果未找到匹配的 hash,您可以 100% 确定此记录尚未导入,因此您可以导入它并将其 hash 存储到内存中的哈希集中。

Alternatively, you can use Mongo's hashed indexes for a similar effect.或者,您可以使用 Mongo 的散列索引来达到类似的效果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM