简体   繁体   English

拆分Mongo集合(按索引)

[英]Split Mongo Collection (by Index)

I currently have a long-running operation (running in Python+Celery) that goes through an entire Mongo collection of about 43,000,000 elements and does an analysis on the elements without making any changes to those elements. 我目前有一个长期运行的操作(在Python + Celery中运行),它遍历了大约43,000,000个元素的整个Mongo集合,并对元素进行了分析,而没有对这些元素进行任何更改。

As this collection has grown the operation has started to take longer (obviously) and now is periodically failing usually due to a timeout to a different database. 随着此集合的增长,该操作已开始花费更长的时间(显然),现在通常由于另一个数据库超时而定期失败。

I would like to split this operation into several smaller operations--perhaps operating on just a few million elements--and I'm wondering about the best way to produce the queries that will do the splitting. 我想将此操作拆分为几个较小的操作-可能仅对几百万个元素进行操作-我想知道生成将进行拆分的查询的最佳方法。 I only have one index on this collection and its the _id . 我对此集合及其_id只有一个索引。

The obvious answer seemed to be something like: 显而易见的答案似乎是这样的:

# This is executed in parallel on different servers
def doAnalysis(skipped,limit) 
    db.<collection>.find().skip(skipped).limit(limit)

...

# This is the parent task
elemsToAnalyze = db.<collection>.find().count()/10;
for i in range(0,10:
    doAnalysis(elemsToAnalyze * i, elemsToAnalyze)

But it turns out the .skip() takes a long time--basically just as long as actually performing the analysis! 但是事实证明.skip()需要长时间-基本与实际执行分析一样长! Is there a better way to do this? 有一个更好的方法吗?

skip() can be very slow in this kind of case. 在这种情况下, skip()可能非常慢。 You could do range queries instead by using the last _id of the batch to query the next batch. 您可以通过使用批次的最后一个_id来查询下一个批次来进行范围查询。 Something like this: 像这样:

db.<collection>.find({ "_id" : { $gte: prev_batch_last_id } } ).sort( { _id : 1 } ).limit(limit);

You'll have to store the last id of the batch to a variable yourself. 您必须自己将批次的最后一个ID存储到变量中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM