简体   繁体   English

MongoDB:集合联合上的聚合($sort)非常慢

[英]MongoDB: Aggregation ($sort) on a union of collections very slow

I have a few collections where I need to perform a union on, then query.我有一些集合需要在其中执行联合,然后查询。 However, I realise this is very slow for some reason.但是,我意识到由于某种原因这非常慢。 The explain is not that helpful as it only tells if the 1st $match stage is indexed.解释没有那么有用,因为它只说明第一个$match阶段是否被索引。 I am using a pipeline like:我正在使用如下管道:

[
    {
        "$match": {
            "$and": [
                { ... }
            ]
        }
    },

    // repeat this chunk for each collection
    {
        "$unionWith": {
            "coll": "anotherCollection",
            "pipeline": [
                {
                    "$match": {
                        "$and": [
                            { ... }
                        ]
                    }
                },
            ]
        }
    },

    // Then an overall limit/handle pagination for all the unioned results
    // UPDATE: Realised the sort is the culprit
    { "$sort": { "createdAt": -1 } },
    { "$skip": 0},
    { "$limit": 50 }
]

Is there a better way to do such a query?有没有更好的方法来做这样的查询? Does mongo do the unions in parallel maybe? mongo 是否可以并行执行工会? Is there a "DB View" I can use to obtain a union of all the collections?我可以使用“数据库视图”来获取所有集合的联合吗?

UPDATE : Just realised the runtime increase once I add the sort.更新:添加排序后,我刚刚意识到运行时间增加。 I suspect it cannot use indexes because its on a union?我怀疑它不能使用索引,因为它在联合上?

Yes, there is a way.是的,有办法。 But it's not that trivial, you need to change how you do pagination.但这并不是那么微不足道,您需要更改分页方式。 It requires more engineering, as you got to keep track of the page not only by number, but also by last elements found它需要更多的工程,因为您不仅要按数字跟踪页面,还要按找到的最后一个元素

If you paginate by filtering by a unique identifier (usually _id) with a cursor you can do early filtering.如果您通过使用游标按唯一标识符(通常为 _id)过滤进行分页,则可以进行早期过滤。

!!! !!! Important !!!重要的 !!! You will need to keep track of the last item found instead of skipping a number of elements.您将需要跟踪找到的最后一个项目,而不是跳过许多元素。 If you don't do so, you will lose track of the pagination, and maybe never return some data, or return some twice, which is way worse than being slow如果你不这样做,你将失去对分页的跟踪,并且可能永远不会返回一些数据,或者返回两次,这比速度慢要糟糕得多

[
    {
        "$match": {
            "$and": [
                { ... }
            ],
            "_id":{"$gt": lastKnownIdOfCollectionA} // this will filter out everything you already saw, so no skip needed
        }
    },
    { "$sort": { "createdAt": -1 } }, // this sorting is indexed!
    { "$limit": 50 } // maybe you will take 0 but max 50, you don't care about the rest
    // repeat this chunk for each collection
    {
        "$unionWith": {
            "coll": "anotherCollection",
            "pipeline": [
                {
                    "$match": {
                        "$and": [
                            { ... }
                        ],
                        "_id":{"$gt": lastKnownIdOfCollectionB} // this will filter out everything you already saw, so no skip needed
                    }
                },
                { "$sort": { "createdAt": -1 } }, // this sorting is indexed!
                { "$limit": 50 } // maybe you will take 0 but max 50, you don't care about the rest
            ]
        }
    },

    // At this point you have MAX 100 elements, an index is not needed for sorting :)
    { "$sort": { "createdAt": -1 } },
    { "$skip": 0},
    { "$limit": 50 }
]

In this example, I do the early filter by _id which also contains the createdAt timestamp.在此示例中,我通过 _id 进行早期过滤,其中还包含 createdAt 时间戳。 If the filtering is not about the creation date you might have to define which identifier will suit the most.如果过滤与创建日期无关,您可能必须定义最适合的标识符。 Remember the identifier must be a unique identifier, but you can use more than one value combined (eg. createdAt + randomizedId)请记住,标识符必须是唯一标识符,但您可以组合使用多个值(例如 createdAt + randomId)

No, there is no way.不,没有办法。

There are views, but they won't help with performance.有观点,但它们对性能没有帮助。 They are merely a syntax sugar to save the aggregation query on database side.它们只是在数据库端保存聚合查询的语法糖。 They will run exactly the same search down the road.他们将在路上进行完全相同的搜索。

There are materialised views - essentially collections with saved results.有物化视图 - 本质上是保存结果的集合。 They will work super fast, with a caveat of very eventual consistency, and only if you don't forget to refresh them.它们将运行得非常快,但需要注意的是最终的一致性,并且前提是您不要忘记刷新它们。 There are only few very niche cases I can imagine where materialised views could be helpful.我能想象到的只有少数非常小众的案例可以帮助物化视图。

You will bee way better with storing all documents in a single collection at the first place, if you aim to optimise such queries.如果您的目标是优化此类查询,那么首先将所有文档存储在一个集合中会更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM