简体   繁体   English

迭代mongo数据的更快方法是什么? 批量查找与游标迭代?

[英]What is the faster way to iterate over mongo data? Find in batch vs iterate with cursor?

I have over 10 million records in my mongo collection that I want to move to some other database. 我的mongo集合中有超过1000万条记录,希望移至其他数据库。

There are two methods on how I can achieve that : 有两种方法可以实现这一目标:

Batching data with find 用find批处理数据

const batchSize = 1000;
const collection = mongo.client.collection('test');
const count = await quizVersionCollection.count();
let iter = 0;
while (iter * batchSize <= count) {
  const dataArr = await collection.find({})
                  .sort({ _id: -1 })
                  .limit(batchSize)
                  .skip(iter * batchSize)
                  .toArray();
  iter += 1;
}

Using mongo cursor 使用蒙哥游标

while (yield cursor.hasNext()) {
    const ids = [];
    const batchSize = 1000;
    for (let i = 0; i < batchSize; i += 1) {
      if (yield cursor.hasNext()) {
        ids.push((yield cursor.next())._id);
      }
    }
    done += batchSize;
  }

In the first method, I am making a single request for every 1000 documents whereas in the second one I am making 2 requests for every single document. 在第一种方法中,我对每1000个文档提出一个请求,而在第二种方法中,我对每一个文档提出2个请求。 Which is the better method in terms of speed and computation? 就速度和计算而言,哪种方法更好?

The first method is better because as you said: you are making just 1 call per 1000 documents. 第一种方法更好,因为正如您所说:每1000个文档仅发出一次呼叫。 Thus you are saving all the network traffic that will be generated if you get documents one by one. 因此,您将保存所有如果一一获得文件便会产生的网络流量。 The second method would take a lot of network time since it is fetching documents one by one. 第二种方法将花费大量的网络时间,因为它是一个一个地获取文档。

Some tips: 一些技巧:

  1. It is never a good idea to use skip in mongo queries because according to mongodb documentation : 在mongo查询中使用skip绝不是一个好主意,因为根据mongodb文档

    The cursor.skip() method requires the server to scan from the beginning of the input results set before beginning to return results. cursor.skip()方法要求服务器从输入结果集的开头开始扫描,然后再开始返回结果。 As the offset increases, cursor.skip() will become slower. 随着偏移量的增加,cursor.skip()会变慢。

  2. Set batch size to something just less than 16MB/(the average size of your document). 将批处理大小设置为仅小于16MB /(文档的平均大小)。 This is because mongoDB has a 16MB limit on the response size. 这是因为mongoDB的响应大小限制为16MB。 This way you can minimize the number of calls you make. 这样,您可以最大程度地减少拨打电话的次数。

  3. If you can use multi-threading, divide data in 7 groups, get the ids at the interval boundary and use those ids to create range conditions. 如果可以使用多线程,则将数据分为7组,在间隔边界获取ids ,然后使用这些ID创建范围条件。 Then you can remove sort , limit and skip . 然后,您可以删除sortlimitskip This will make a huge impact on performance. 这将对性能产生巨大影响。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM