$sample 的 MongoDB 聚合非常慢

Question

There are many ways to select random document from a mongodb collection (as discussed in this answer ).有多种方法可以从 mongodb 集合中选择随机文档（如本答案所述）。 Comments point out that with mongodb version >= 3.2 then using $sample in the aggregation framework is preferred.评论指出，对于 mongodb 版本 >= 3.2 然后在聚合框架中使用$sample是首选。 However, on a collection with many small documents this seems to extremely slow.但是，在包含许多小文档的集合上，这似乎非常慢。

The following code uses mongoengine to simulate the issue and compare it to the "skip random" method:以下代码使用 mongoengine 来模拟问题并将其与“跳过随机”方法进行比较：

import timeit
from random import randint

import mongoengine as mdb

mdb.connect("test-agg")


class ACollection(mdb.Document):
    name = mdb.StringField(unique=True)

    meta = {'indexes': ['name']}


ACollection.drop_collection()

ACollection.objects.insert([ACollection(name="Document {}".format(n)) for n in range(50000)])


def agg():
    doc = list(ACollection.objects.aggregate({"$sample": {'size': 1}}))[0]
    print(doc['name'])

def skip_random():
    n = ACollection.objects.count()
    doc = ACollection.objects.skip(randint(1, n)).limit(1)[0]
    print(doc['name'])


if __name__ == '__main__':
    print("agg took {:2.2f}s".format(timeit.timeit(agg, number=1)))
    print("skip_random took {:2.2f}s".format(timeit.timeit(skip_random, number=1)))

The result is:结果是：

Document 44551
agg took 21.89s
Document 25800
skip_random took 0.01s

Wherever I've had performance issues with mongodb in the past my answer has always been to use the aggregation framework so I'm surprised $sample is so slow.过去无论我在哪里遇到 mongodb 的性能问题，我的答案都是使用聚合框架，所以我很惊讶$sample这么慢。

Am I missing something here?我在这里错过了什么吗？ What is it about this example that is causing the aggregation to take so long?这个例子是什么导致聚合需要这么长时间？

Answer 1

I can confirm that nothing has changed in 3.6 Slow $sample problem persists.我可以确认在 3.6 Slow $sample 问题中没有任何改变。 Shame on you, MongoDB team.为你感到羞耻，MongoDB 团队。

~40m collection of small documents, no indexes, Windows Server 2012 x64. ~40m 小文档集合，无索引，Windows Server 2012 x64。

storage: wiredTiger.engineConfig.journalCompressor: zlib wiredTiger.collectionConfig.blockCompressor: zlib存储：wiredTiger.engineConfig.journalCompressor：zlibwiredTiger.collectionConfig.blockCompressor：zlib

2018-04-02T02:27:27.743-0700 I COMMAND [conn4] command maps.places 2018-04-02T02:27:27.743-0700 我命令 [conn4] 命令 maps.places

command: aggregate { aggregate: "places", pipeline: [ { $sample: { size: 10 } } ] ,命令：聚合 { 聚合：“地点”，管道：[ { $sample: { size: 10 } } ] ，

 cursor: {}, lsid: { id: UUID("0e846097-eecd-40bb-b47c-d77f1484dd7e") }, $readPreference: { mode: "secondaryPreferred" }, $db: "maps" } planSummary: MULTI_ITERATOR keysExamined:0 docsExamined:0 cursorExhausted:1 numYields:3967 nreturned:10 reslen:550 locks:{ Global: { acquireCount: { r: 7942 } }, Database: { acquireCount: { r: 3971 } }, Collection: { acquireCount: { r: 3971 } } }

protocol:op_query 72609ms协议：op_query 72609ms

I have installed Mongo to try this "modern and performant DBMS" in a serious project.我已经安装了 Mongo 来在一个严肃的项目中尝试这个“现代和高性能的 DBMS”。 How deeply I am frustrated.我是多么的沮丧。

Explain plan is here:解释计划在这里：

db.command('aggregate', 'places', pipeline=[{"$sample":{"size":10}}], explain=True)

 {'ok': 1.0,
  'stages': [{'$cursor': {'query': {},
    'queryPlanner': {'indexFilterSet': False,
     'namespace': 'maps.places',
     'plannerVersion': 1,
     'rejectedPlans': [],
     'winningPlan': {'stage': 'MULTI_ITERATOR'}}}},
  {'$sampleFromRandomCursor': {'size': 10}}]}

Answer 2

This is a result of a known bug in the WiredTiger engine in versions of mongodb < 3.2.3.这是 mongodb < 3.2.3 版本中 WiredTiger 引擎中已知错误的结果。 Upgrading to the latest version should solve this.升级到最新版本应该可以解决这个问题。

Answer 3

For those who are confused with $sample , $sample would be efficient under following conditions:对于那些谁混淆$sample ， $sample将下列条件下是有效的：

$sample is the first stage of the pipeline $sample是管道的第一阶段
N is less than 5% of the total documents in the collection N小于集合中文档总数的 5%
The collection contains more than 100 documents该集合包含 100 多个文档

If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents.如果不满足上述任何条件， $sample执行集合扫描，然后随机排序以选择N文档。

More on: https://docs.mongodb.com/manual/reference/operator/aggregation/sample/更多信息： https : //docs.mongodb.com/manual/reference/operator/aggregation/sample/

Answer 4

Mongo states that蒙戈指出

If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:如果满足以下所有条件，$sample 使用伪随机游标来选择文档：

$sample is the first stage of the pipeline $sample 是管道的第一阶段
N is less than 5% of the total documents in the collection N 小于集合中文档总数的 5%
The collection contains more than 100 documents该集合包含 100 多个文档

If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents.如果不满足上述任何条件，$sample 执行集合扫描，然后随机排序以选择 N 个文档。 In this case, the $sample stage is subject to the sort memory restrictions.在这种情况下，$sample 阶段受排序内存限制。

I believe in your case mongo makes full scan我相信你的情况 mongo 会进行全面扫描

Reference: https://docs.mongodb.com/manual/reference/operator/aggregation/sample/参考： https : //docs.mongodb.com/manual/reference/operator/aggregation/sample/

$sample 的 MongoDB 聚合非常慢

问题描述

4 个解决方案

解决方案1
2 2018-04-02 09:44:39

解决方案2
1 已采纳 2016-06-13 10:47:12

解决方案3
1 2020-06-08 05:04:03

解决方案4
1 2020-12-05 07:47:07

$sample 的 MongoDB 聚合非常慢

问题描述

4 个解决方案

解决方案1 2 2018-04-02 09:44:39

解决方案2 1 已采纳 2016-06-13 10:47:12

解决方案3 1 2020-06-08 05:04:03

解决方案4 1 2020-12-05 07:47:07

解决方案1
2 2018-04-02 09:44:39

解决方案2
1 已采纳 2016-06-13 10:47:12

解决方案3
1 2020-06-08 05:04:03

解决方案4
1 2020-12-05 07:47:07