简体   繁体   English

如何在 mongo db 中优化此查询?

[英]How can I optimize this query in mongo db?

Here is the query:这是查询:

    const tags = await mongo
      .collection("positive")
      .aggregate<{ word: string; count: number }>([
        {
          $lookup: {
            from: "search_history",
            localField: "search_id",
            foreignField: "search_id",
            as: "history",
            pipeline: [
              {
                $match: {
                  created_at: { $gt: prevSunday.toISOString() },
                },
              },
              {
                $group: {
                  _id: "$url",
                },
              },
            ],
          },
        },
        {
          $match: {
            history: { $ne: [] },
          },
        },
        {
          $group: {
            _id: "$word",
            url: {
              $addToSet: "$history._id",
            },
          },
        },
        {
          $project: {
            _id: 0,
            word: "$_id",
            count: {
              $size: {
                $reduce: {
                  input: "$url",
                  initialValue: [],
                  in: {
                    $concatArrays: ["$$value", "$$this"],
                  },
                },
              },
            },
          },
        },
        {
          $sort: {
            count: -1,
          },
        },
        {
          $limit: 50,
        },
      ])
      .toArray();

I think I need an index but not sure how or where to add.我想我需要一个索引,但不确定如何或在哪里添加。

Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.在我们确认该方法本身是合理的满足所需的应用程序逻辑之后,也许应该重新审视该操作的性能。

When it comes to performance, there is nothing that can be done to improve efficiency on the positive collection if the intention is to process every document.在性能方面,如果目的是处理每个文档,则无法提高positive收集的效率。 By definition, processing all documents requires a full collection scan.根据定义,处理所有文档需要完整的集合扫描。

To efficiently support the $lookup on the search_history collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 } exists.为了有效地支持search_history集合上的$lookup ,您可能希望确认{ search_id: 1, created_at: 1, url: 1 }上的索引存在。 Providing the .explain("allPlansExecution") output would allow us to better understand the current performance characteristics.提供.explain("allPlansExecution") output 将使我们能够更好地了解当前的性能特征。

Desired Logic所需逻辑

Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation.更新问题以包含有关架构和聚合目的的详细信息对于理解整体情况非常有帮助。 Just looking at the aggregation, it appears to be doing the following:仅查看聚合,它似乎正在执行以下操作:

  • For every single document in the positive collection, add a new field called history .对于positive集合中的每个文档,添加一个名为history的新字段。
  • This new field is a list of url values from the search_history collection where the corresponding document has a matching search_id value and was created_at after last Sunday.这个新字段是来自search_history集合的url值的列表,其中相应的文档具有匹配的search_id值并且是在上created_at之后创建的。
  • The aggregation then filters to only keep documents where the new history field has at least one entry.然后聚合过滤以仅保留新history字段具有至少一个条目的文档。
  • The next stage then groups the results together by word .下一阶段然后按word将结果组合在一起。 The $addToSet operator is used here, but it may be generating an array of arrays rather than de-duplicated url s.这里使用了$addToSet运算符,但它可能会生成一个 arrays 数组,而不是去重复的url数组。
  • The final 3 stages of the aggregation seem to be focused on calculating the number of url s and returning the top 50 results by word sorted on that size in descending order.聚合的最后 3 个阶段似乎侧重于计算url的数量,并按按该大小降序排序的word返回前50结果。

Is this what you want?这是你想要的吗? In particular the following aspects may be worth confirming:特别是以下方面可能值得确认:

  • Is it your intention to process every document in the positive collection?您是否打算处理positive集合中的每个文档? This may be the case, but it's impossible to tell without any schema/use-case context.可能是这种情况,但如果没有任何模式/用例上下文,就无法判断。
  • Is the size calculation of the url s correct? url的尺寸计算是否正确? It seems like you may need to use a $map when doing the $addToSet for the $group instead of using $reduce for the subsequent $project .在为$group执行$addToSet而不是为后续的$project使用$reduce时,您似乎可能需要使用$map

The best thing to do is to limit the number of documents passed to each stage.最好的办法是限制传递到每个阶段的文档数量。 Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.只有在匹配时,mongo 才会在聚合中使用索引,最大使用 1 个索引。

So the best thing to do is to have a match on an indexed field that is very restrictive.所以最好的办法是在一个非常严格的索引字段上进行匹配。

Moreover, please note that $limit , $skip and $sample are not panaceas because they still scan the entire collection.此外,请注意$limit$skip$sample不是万能的,因为它们仍然会扫描整个集合。

A way to efficiently limit the number of documents selected on the first stage is to use a "pagination".一种有效限制第一阶段选择的文档数量的方法是使用“分页”。 You can make it work like this:你可以让它像这样工作:

Once every X requests每 X 请求一次

  1. Count the number of docs in the collection计算集合中的文档数
  2. Divide this in chunks of Yk max把它分成 Yk max 的块
  3. Find the _ids of the docs at the place Y, 2Y, 3Y etc with skip and limit使用跳过和限制在 Y、2Y、3Y 等位置查找文档的 _id
  4. Cache the results in redis/memcache (or as global variable if you really cannot do otherwise)将结果缓存在 redis/memcache 中(或者如果你真的不能这样做,则作为全局变量)

Every request每一个请求

  1. Get the current chunk to scan by reading the redis keys used and nbChunks通过读取 redis used的密钥和nbChunks获取当前要扫描的块
  2. Get the _ids cached in redis used to delimit the next aggregation id:${used%nbChunks} and id:${(used%nbChunks)+1} respectively获取redis中缓存的_ids,分别用于分隔下一个聚合id:${used%nbChunks}id:${(used%nbChunks)+1}
  3. Aggregate using $match with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }使用带有 _id 的$match进行聚合_id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
  4. Increment used , if used > X then update chunks used增量,如果used > X则更新块

Further optimisation进一步优化

If using redis, supplement every key with ${cluster.worker.id}: to avoid hot keys.如果使用 redis,请在每个键后面加上${cluster.worker.id}:以避免热键。

Notes笔记

  1. The step 3) of the setup of chunks can be a really long and intensive process, so do it only when necessary, let's say every X~1k requests.设置块的步骤 3) 可能是一个非常漫长而密集的过程,因此仅在必要时进行,假设每个 X~1k 请求。
  2. If you are scanning the last chunk, do not put the $lt如果您正在扫描最后一个块,请不要将$lt
  3. Once this process implemented, your job is to find the sweet spot of X and Y that suits your needs, constrained by a Y being large enough to retrieve max documents while being not too long and a X that keeps the chunks roughly equals as the collection has more and more documents.一旦这个过程实施,你的工作就是找到适合你需要的 X 和 Y 的最佳位置,受制于 Y 足够大以检索最大文档同时不太长,并且 X 保持块大致等于集合有越来越多的文件。
  4. This process is a bit long to implement but once it is, time complexity is ~O(Y) and not ~O(N).这个过程实现起来有点长,但是一旦实现,时间复杂度是~O(Y)而不是~O(N)。 Indeed, the $match being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.事实上, $match是第一阶段,_id 是一个被索引的字段,这个第一阶段非常快,并且限制了扫描的最大 Y 个文档。

Hope it help =) Make sure to ask more if needed =)希望它有所帮助=)如果需要,请务必询问更多=)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM