比较数百万条 mongoDB 记录变化的最佳方法

Question

I am working on a project where I store dns records of millions of websites and I need to monitor and update changes in these data periodically.我正在从事一个项目，在该项目中我存储了数百万个网站的 dns 记录，我需要定期监视和更新这些数据的变化。 The data is stored on a mongodb as follows数据存储在mongodb上如下

{
  domain: "www.google.com",
  "IP": [
         {
           "value":"216.58.198.78",
           "first_seen":"2020-02-01 00:00:00",
           "last_seen":"2020-02-10 00:00:00"
          },
        
          {
           "value":"216.58.198.75",
           "first_seen":"2020-02-11 00:00:00",
           "last_seen":"2020-02-25 00:00:00"
          },
          ...
         ]
        
}

I run periodic scans to get new domains and fresh DNS records and I would like to know the best way to compare it with data stored in DB and update it.我运行定期扫描以获取新域和新的 DNS 记录，我想知道将它与存储在数据库中的数据进行比较并更新它的最佳方法。

What I am thinking, is to do the following.我在想的是做以下事情。

Retrieve all records from DB (I do not think this is good at all)从数据库中检索所有记录（我认为这根本不好）
Store retrieved data into a python dictionary with domain as its key将检索到的数据存储到以域为键的 Python 字典中
Loop through the fresh records循环遍历新记录
Check if the domain exists within dictionary, then compare changes and perform necessary updates to the dictionary.检查域是否存在于字典中，然后比较更改并对字典执行必要的更新。
If the domain does not exist, add it to the dictionary如果域不存在，则将其添加到字典中
Drop the collection?放弃收藏？
Perform bulk write operation to store the new values执行批量写入操作以存储新值

This sounds terrible in performance and memory consumption (we are storing millions of records in memory) but I am not sure if other alternatives (query then update) would do any better (cuz we'd need to perform millions of transactions)这在性能和内存消耗方面听起来很糟糕（我们在内存中存储了数百万条记录），但我不确定其他替代方案（查询然后更新）是否会做得更好（因为我们需要执行数百万个事务）

I would appreciate if you can provide some insights on the best way to achieve this or guide me to areas of research that might help.如果您能就实现这一目标的最佳方式提供一些见解或指导我进行可能有帮助的研究领域，我将不胜感激。

Thanks谢谢

Answer 1

The normal practice is to add a data field (eg "NeedUpdate") on the database table.通常的做法是在数据库表上添加一个数据字段（例如“NeedUpdate”）。

On creating a new record, the "NeedUpdate" will be "ON" for that record创建新记录时，该记录的“NeedUpdate”将为“ON”

Upon updating an existing record, the "NeedUpdate" will be set as "ON" too更新现有记录后，“NeedUpdate”也将设置为“ON”

After that, you can run a cron job (or any period scans) to process the records with "NeedUpdate"="ON" (and after processing, set the "NeedUpdate=''".之后，您可以运行 cron 作业（或任何周期扫描）来处理“NeedUpdate”="ON" 的记录（处理后，设置“NeedUpdate=''”。

In that case the system only needs to process the records which require update.在这种情况下，系统只需要处理需要更新的记录。

比较数百万条 mongoDB 记录变化的最佳方法

问题描述

1 个解决方案

解决方案1
0 2020-10-27 01:25:20

比较数百万条 mongoDB 记录变化的最佳方法

问题描述

1 个解决方案

解决方案1 0 2020-10-27 01:25:20

解决方案1
0 2020-10-27 01:25:20