简体   繁体   English

比较数百万条 mongoDB 记录变化的最佳方法

[英]Best way to compare changes in millions of mongoDB records

I am working on a project where I store dns records of millions of websites and I need to monitor and update changes in these data periodically.我正在从事一个项目,在该项目中我存储了数百万个网站的 dns 记录,我需要定期监视和更新这些数据的变化。 The data is stored on a mongodb as follows数据存储在mongodb上如下

{
  domain: "www.google.com",
  "IP": [
         {
           "value":"216.58.198.78",
           "first_seen":"2020-02-01 00:00:00",
           "last_seen":"2020-02-10 00:00:00"
          },
        
          {
           "value":"216.58.198.75",
           "first_seen":"2020-02-11 00:00:00",
           "last_seen":"2020-02-25 00:00:00"
          },
          ...
         ]
        
}

I run periodic scans to get new domains and fresh DNS records and I would like to know the best way to compare it with data stored in DB and update it.我运行定期扫描以获取新域和新的 DNS 记录,我想知道将它与存储在数据库中的数据进行比较并更新它的最佳方法。

What I am thinking, is to do the following.我在想的是做以下事情。

  1. Retrieve all records from DB (I do not think this is good at all)从数据库中检索所有记录(我认为这根本不好)
  2. Store retrieved data into a python dictionary with domain as its key将检索到的数据存储到以域为键的 Python 字典中
  3. Loop through the fresh records循环遍历新记录
  4. Check if the domain exists within dictionary, then compare changes and perform necessary updates to the dictionary.检查域是否存在于字典中,然后比较更改并对字典执行必要的更新。
  5. If the domain does not exist, add it to the dictionary如果域不存在,则将其添加到字典中
  6. Drop the collection?放弃收藏?
  7. Perform bulk write operation to store the new values执行批量写入操作以存储新值

This sounds terrible in performance and memory consumption (we are storing millions of records in memory) but I am not sure if other alternatives (query then update) would do any better (cuz we'd need to perform millions of transactions)这在性能和内存消耗方面听起来很糟糕(我们在内存中存储了数百万条记录),但我不确定其他替代方案(查询然后更新)是否会做得更好(因为我们需要执行数百万个事务)

I would appreciate if you can provide some insights on the best way to achieve this or guide me to areas of research that might help.如果您能就实现这一目标的最佳方式提供一些见解或指导我进行可能有帮助的研究领域,我将不胜感激。

Thanks谢谢

The normal practice is to add a data field (eg "NeedUpdate") on the database table.通常的做法是在数据库表上添加一个数据字段(例如“NeedUpdate”)。

On creating a new record, the "NeedUpdate" will be "ON" for that record创建新记录时,该记录的“NeedUpdate”将为“ON”

Upon updating an existing record, the "NeedUpdate" will be set as "ON" too更新现有记录后,“NeedUpdate”也将设置为“ON”

After that, you can run a cron job (or any period scans) to process the records with "NeedUpdate"="ON" (and after processing, set the "NeedUpdate=''".之后,您可以运行 cron 作业(或任何周期扫描)来处理“NeedUpdate”="ON" 的记录(处理后,设置“NeedUpdate=''”。

In that case the system only needs to process the records which require update.在这种情况下,系统只需要处理需要更新的记录。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从数百万/数十亿条记录中删除 MongoDB 4 中的重复项 - Removing duplicates in MongoDB 4 from millions/billions of records 有没有办法在 python 中使用 sodapy 获取数百万条记录? - Is there a way to get millions of records using sodapy in python? 搜索数百万个JSON文件的最佳方法是什么? - What is the best way to search millions of JSON files? 最快的方法来比较pandas数据帧中的行和上一行以及数百万行 - Fastest way to compare row and previous row in pandas dataframe with millions of rows 查找数百万条记录的异常 - Finding anomalies for millions of records 比较字符串的最佳方法是什么? - What is the best way compare strings? 根据具有数百万行的 dataframe 中的匹配条件更快地识别和比较行 - Faster way to identify and compare rows based on matching conditions within a dataframe having millions of rows 查找数百万条记录的 Pearson 相关性 - Finding Pearson correlation for millions of records 用数百万条记录更新表上的记录 - update records on table with millions record 使用特定模型通过 Tinkerpop 将(数百万行)数据导入 Janusgraph 的最佳方法 - Best way to get (millions of rows of) data into Janusgraph via Tinkerpop, with a specific model
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM