拆分Mongo集合（按索引）

Question

I currently have a long-running operation (running in Python+Celery) that goes through an entire Mongo collection of about 43,000,000 elements and does an analysis on the elements without making any changes to those elements. 我目前有一个长期运行的操作（在Python + Celery中运行），它遍历了大约43,000,000个元素的整个Mongo集合，并对元素进行了分析，而没有对这些元素进行任何更改。

As this collection has grown the operation has started to take longer (obviously) and now is periodically failing usually due to a timeout to a different database. 随着此集合的增长，该操作已开始花费更长的时间（显然），现在通常由于另一个数据库超时而定期失败。

I would like to split this operation into several smaller operations--perhaps operating on just a few million elements--and I'm wondering about the best way to produce the queries that will do the splitting. 我想将此操作拆分为几个较小的操作-可能仅对几百万个元素进行操作-我想知道生成将进行拆分的查询的最佳方法。 I only have one index on this collection and its the _id . 我对此集合及其_id只有一个索引。

The obvious answer seemed to be something like: 显而易见的答案似乎是这样的：

# This is executed in parallel on different servers
def doAnalysis(skipped,limit) 
    db.<collection>.find().skip(skipped).limit(limit)

...

# This is the parent task
elemsToAnalyze = db.<collection>.find().count()/10;
for i in range(0,10:
    doAnalysis(elemsToAnalyze * i, elemsToAnalyze)

But it turns out the .skip() takes a long time--basically just as long as actually performing the analysis! 但是事实证明.skip()需要很长时间-基本与实际执行分析一样长！ Is there a better way to do this? 有一个更好的方法吗？

Answer 1

skip() can be very slow in this kind of case. 在这种情况下， skip()可能非常慢。 You could do range queries instead by using the last _id of the batch to query the next batch. 您可以通过使用批次的最后一个_id来查询下一个批次来进行范围查询。 Something like this: 像这样：

db.<collection>.find({ "_id" : { $gte: prev_batch_last_id } } ).sort( { _id : 1 } ).limit(limit);

You'll have to store the last id of the batch to a variable yourself. 您必须自己将批次的最后一个ID存储到变量中。

拆分Mongo集合（按索引）

问题描述

1 个解决方案

解决方案1
1 2012-05-31 23:11:38

拆分Mongo集合（按索引）

问题描述

1 个解决方案

解决方案1 1 2012-05-31 23:11:38

解决方案1
1 2012-05-31 23:11:38