[英]Split Mongo Collection (by Index)
I currently have a long-running operation (running in Python+Celery) that goes through an entire Mongo collection of about 43,000,000 elements and does an analysis on the elements without making any changes to those elements. 我目前有一个长期运行的操作(在Python + Celery中运行),它遍历了大约43,000,000个元素的整个Mongo集合,并对元素进行了分析,而没有对这些元素进行任何更改。
As this collection has grown the operation has started to take longer (obviously) and now is periodically failing usually due to a timeout to a different database. 随着此集合的增长,该操作已开始花费更长的时间(显然),现在通常由于另一个数据库超时而定期失败。
I would like to split this operation into several smaller operations--perhaps operating on just a few million elements--and I'm wondering about the best way to produce the queries that will do the splitting. 我想将此操作拆分为几个较小的操作-可能仅对几百万个元素进行操作-我想知道生成将进行拆分的查询的最佳方法。 I only have one index on this collection and its the
_id
. 我对此集合及其
_id
只有一个索引。
The obvious answer seemed to be something like: 显而易见的答案似乎是这样的:
# This is executed in parallel on different servers
def doAnalysis(skipped,limit)
db.<collection>.find().skip(skipped).limit(limit)
...
# This is the parent task
elemsToAnalyze = db.<collection>.find().count()/10;
for i in range(0,10:
doAnalysis(elemsToAnalyze * i, elemsToAnalyze)
But it turns out the .skip()
takes a long time--basically just as long as actually performing the analysis! 但是事实证明
.skip()
需要很长时间-基本与实际执行分析一样长! Is there a better way to do this? 有一个更好的方法吗?
skip()
can be very slow in this kind of case. 在这种情况下,
skip()
可能非常慢。 You could do range queries instead by using the last _id
of the batch to query the next batch. 您可以通过使用批次的最后一个
_id
来查询下一个批次来进行范围查询。 Something like this: 像这样:
db.<collection>.find({ "_id" : { $gte: prev_batch_last_id } } ).sort( { _id : 1 } ).limit(limit);
You'll have to store the last id of the batch to a variable yourself. 您必须自己将批次的最后一个ID存储到变量中。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.