[英]How can I optimize this query in mongo db?
Here is the query:这是查询:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.我想我需要一个索引,但不确定如何或在哪里添加。
Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.在我们确认该方法本身是合理的满足所需的应用程序逻辑之后,也许应该重新审视该操作的性能。
When it comes to performance, there is nothing that can be done to improve efficiency on the positive
collection if the intention is to process every document.在性能方面,如果目的是处理每个文档,则无法提高
positive
收集的效率。 By definition, processing all documents requires a full collection scan.根据定义,处理所有文档需要完整的集合扫描。
To efficiently support the $lookup
on the search_history
collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 }
exists.为了有效地支持
search_history
集合上的$lookup
,您可能希望确认{ search_id: 1, created_at: 1, url: 1 }
上的索引存在。 Providing the .explain("allPlansExecution")
output would allow us to better understand the current performance characteristics.提供
.explain("allPlansExecution")
output 将使我们能够更好地了解当前的性能特征。
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation.更新问题以包含有关架构和聚合目的的详细信息对于理解整体情况非常有帮助。 Just looking at the aggregation, it appears to be doing the following:
仅查看聚合,它似乎正在执行以下操作:
positive
collection, add a new field called history
.positive
集合中的每个文档,添加一个名为history
的新字段。url
values from the search_history
collection where the corresponding document has a matching search_id
value and was created_at
after last Sunday.search_history
集合的url
值的列表,其中相应的文档具有匹配的search_id
值并且是在上created_at
之后创建的。history
field has at least one entry.history
字段具有至少一个条目的文档。word
.word
将结果组合在一起。 The $addToSet
operator is used here, but it may be generating an array of arrays rather than de-duplicated url
s.$addToSet
运算符,但它可能会生成一个 arrays 数组,而不是去重复的url
数组。url
s and returning the top 50
results by word
sorted on that size in descending order.url
的数量,并按按该大小降序排序的word
返回前50
结果。 Is this what you want?这是你想要的吗? In particular the following aspects may be worth confirming:
特别是以下方面可能值得确认:
positive
collection?positive
集合中的每个文档? This may be the case, but it's impossible to tell without any schema/use-case context.url
s correct? url
的尺寸计算是否正确? It seems like you may need to use a $map
when doing the $addToSet
for the $group
instead of using $reduce
for the subsequent $project
.$group
执行$addToSet
而不是为后续的$project
使用$reduce
时,您似乎可能需要使用$map
。 The best thing to do is to limit the number of documents passed to each stage.最好的办法是限制传递到每个阶段的文档数量。 Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
只有在匹配时,mongo 才会在聚合中使用索引,最大使用 1 个索引。
So the best thing to do is to have a match on an indexed field that is very restrictive.所以最好的办法是在一个非常严格的索引字段上进行匹配。
Moreover, please note that $limit
, $skip
and $sample
are not panaceas because they still scan the entire collection.此外,请注意
$limit
、 $skip
和$sample
不是万能的,因为它们仍然会扫描整个集合。
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination".一种有效限制第一阶段选择的文档数量的方法是使用“分页”。 You can make it work like this:
你可以让它像这样工作:
Once every X requests每 X 请求一次
Every request每一个请求
used
and nbChunks
used
的密钥和nbChunks
获取当前要扫描的块id:${used%nbChunks}
and id:${(used%nbChunks)+1}
respectivelyid:${used%nbChunks}
和id:${(used%nbChunks)+1}
$match
with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
$match
进行聚合_id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
used
, if used > X
then update chunks used
增量,如果used > X
则更新块Further optimisation进一步优化
If using redis, supplement every key with ${cluster.worker.id}:
to avoid hot keys.如果使用 redis,请在每个键后面加上
${cluster.worker.id}:
以避免热键。
Notes笔记
$lt
$lt
$match
being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.$match
是第一阶段,_id 是一个被索引的字段,这个第一阶段非常快,并且限制了扫描的最大 Y 个文档。 Hope it help =) Make sure to ask more if needed =)希望它有所帮助=)如果需要,请务必询问更多=)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.