PyMongo 游标batch_size

Question

With PyMongo 3.7.2 I'm trying to read a collection in chunks by using batch_size on the MongoDB cursor, as described here .使用 PyMongo 3.7.2，我试图通过在 MongoDB 游标上使用 batch_size 来分块读取集合，如here所述。 The basic idea is to use the find() method on the collection object, with batch_size as parameter.基本思想是在集合对象上使用 find() 方法，以batch_size 作为参数。 But whatever I try, the cursor always returns all documents in my collection.但是无论我尝试什么，游标总是返回我集合中的所有文档。

A basic snippet of my code looks like this (the collection has over 10K documents):我的代码的一个基本片段如下所示（该集合有超过 10K 的文档）：

import pymongo as pm

client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')

cur = coll.find({}, batch_size=500)

However, the cursor always returns the full collection size immediately.但是，游标总是立即返回完整的集合大小。 I'm using it as described in the docs.我正在按照文档中的描述使用它。

Does anyone have an idea how I would properly iterate over the collection in batches?有谁知道我如何正确地批量迭代集合？ There are ways to loop over the output of the find() method, but that would still get the full collection first, and would only loop over the already pulled documents in memory.有多种方法可以遍历 find() 方法的输出，但这仍会首先获取完整集合，并且只会遍历内存中已拉取的文档。 The batch_size parameter is supposed to get a batch and make a round-trip every time to the server, to save memory space. batch_size 参数应该每次都获取一个批次并往返于服务器，以节省内存空间。

Answer 1

Pymongo has some quality-of-life helpers for the Cursor class, so it will automatically do the batching for you, and return result to you in terms of documents. Pymongo 为Cursor类提供了一些生活质量助手，因此它会自动为您进行批处理，并将结果以文档形式返回给您。

The batch_size setting is set, but the idea is you only need to set it in the find() method, and not have to do manual low level calls or iterating through the batches. batch_size设置已设置，但想法是您只需要在find()方法中设置它，而不必手动进行低级调用或遍历批次。

For example, if I have 100 documents in my collection:例如，如果我的集合中有 100 个文档：

> db.test.count()
100

I then set the profiling level to log all queries:然后我设置分析级别以记录所有查询：

> db.setProfilingLevel(0,-1)
{
  "was": 0,
  "slowms": 100,
  "sampleRate": 1,
  "ok": 1,
...

I then use pymongo to specify batch_size of 10:然后我使用 pymongo 将batch_size指定为 10：

import pymongo
import bson

conn = pymongo.MongoClient()
cur = conn.test.test.find({}, {'txt':0}, batch_size=10)
print(list(cur))

Running that query, I see in the MongoDB log:运行该查询，我在 MongoDB 日志中看到：

2019-02-22T15:03:54.522+1100 I COMMAND  [conn702] command test.test command: find { find: "test", filter: {} ....
2019-02-22T15:03:54.523+1100 I COMMAND  [conn702] command test.test command: getMore { getMore: 266777378048, collection: "test", batchSize: 10, .... 
(getMore repeated 9 more times)

So the query was fetched from the server in the specified batches.所以查询是以指定的批次从服务器获取的。 It's just hidden from you via the Cursor class.它只是通过Cursor类隐藏起来。

Edit编辑

If you really need to get the documents in batches, there is a function find_raw_batches() under Collection ( doc link ).如果你真的需要批量获取文档，Collection 下有一个函数find_raw_batches() ( doc link )。 This method works similarly to find() and accepts the same parameters.此方法的工作方式与find()类似，并接受相同的参数。 However be advised that it will return raw BSON which will need to be decoded by the application in a separate step.但是请注意，它将返回需要由应用程序在单独的步骤中解码的原始 BSON。 Notably, this method does not support sessions .值得注意的是，此方法不支持session 。

Having said that, if the aim is to lower the application's memory usage, it's worth considering modifying the query so that it uses ranges instead.话虽如此，如果目标是降低应用程序的内存使用量，则值得考虑修改查询，使其使用范围。 For example:例如：

find({'$gte': <some criteria>, '$lte': <some other criteria>})

Range queries are easier to optimize, can use indexes, and (in my opinion) easier to debug and easier to restart should the query gets interrupted.范围查询更容易优化，可以使用索引，并且（在我看来）更容易调试和更容易在查询中断时重新启动。 This is less flexible when using batches, where you have to restart the query from scratch and go over all the batches again if it gets interrupted.这在使用批处理时不太灵活，您必须从头开始重新启动查询，如果它被中断，则再次检查所有批处理。

Answer 2

This is how I do it, it helps getting the data chunked up but I thought there would be a more straight forward way to do this.我就是这样做的，它有助于将数据分块，但我认为会有更直接的方法来做到这一点。 I created a yield_rows function that gets you the generates and yields chunks, it ensures the used chunks are deleted.我创建了一个 yield_rows 函数，它可以让您生成和生成块，它确保删除使用的块。

import pymongo as pm

CHUNK_SIZE = 500
client = pm.MongoClient()
coll = client.get_database('db').get_collection('coll')
cursor = coll.find({}, batch_size=CHUNK_SIZE)

def yield_rows(cursor, chunk_size):
    """
    Generator to yield chunks from cursor
    :param cursor:
    :param chunk_size:
    :return:
    """
    chunk = []
    for i, row in enumerate(cursor):
        if i % chunk_size == 0 and i > 0:
            yield chunk
            del chunk[:]
        chunk.append(row)
    yield chunk

chunks = yield_rows(cursor, CHUNK_SIZE)
for chunk in chunks:
    # do processing here
    pass

If I find a cleaner, more efficient way to do this I'll update my answer.如果我找到一种更清洁、更有效的方法来做到这一点，我会更新我的答案。

PyMongo 游标batch_size

问题描述

2 个解决方案

解决方案1
15 已采纳 2019-02-22 04:11:16

解决方案2
2 2020-05-14 23:57:46

PyMongo 游标batch_size

问题描述

2 个解决方案

解决方案1 15 已采纳 2019-02-22 04:11:16

解决方案2 2 2020-05-14 23:57:46

解决方案1
15 已采纳 2019-02-22 04:11:16

解决方案2
2 2020-05-14 23:57:46