简体   繁体   中英

What is the faster way to iterate over mongo data? Find in batch vs iterate with cursor?

I have over 10 million records in my mongo collection that I want to move to some other database.

There are two methods on how I can achieve that :

Batching data with find

const batchSize = 1000;
const collection = mongo.client.collection('test');
const count = await quizVersionCollection.count();
let iter = 0;
while (iter * batchSize <= count) {
  const dataArr = await collection.find({})
                  .sort({ _id: -1 })
                  .limit(batchSize)
                  .skip(iter * batchSize)
                  .toArray();
  iter += 1;
}

Using mongo cursor

while (yield cursor.hasNext()) {
    const ids = [];
    const batchSize = 1000;
    for (let i = 0; i < batchSize; i += 1) {
      if (yield cursor.hasNext()) {
        ids.push((yield cursor.next())._id);
      }
    }
    done += batchSize;
  }

In the first method, I am making a single request for every 1000 documents whereas in the second one I am making 2 requests for every single document. Which is the better method in terms of speed and computation?

The first method is better because as you said: you are making just 1 call per 1000 documents. Thus you are saving all the network traffic that will be generated if you get documents one by one. The second method would take a lot of network time since it is fetching documents one by one.

Some tips:

  1. It is never a good idea to use skip in mongo queries because according to mongodb documentation :

    The cursor.skip() method requires the server to scan from the beginning of the input results set before beginning to return results. As the offset increases, cursor.skip() will become slower.

  2. Set batch size to something just less than 16MB/(the average size of your document). This is because mongoDB has a 16MB limit on the response size. This way you can minimize the number of calls you make.

  3. If you can use multi-threading, divide data in 7 groups, get the ids at the interval boundary and use those ids to create range conditions. Then you can remove sort , limit and skip . This will make a huge impact on performance.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM