Here is the query:
const tags = await mongo
.collection("positive")
.aggregate<{ word: string; count: number }>([
{
$lookup: {
from: "search_history",
localField: "search_id",
foreignField: "search_id",
as: "history",
pipeline: [
{
$match: {
created_at: { $gt: prevSunday.toISOString() },
},
},
{
$group: {
_id: "$url",
},
},
],
},
},
{
$match: {
history: { $ne: [] },
},
},
{
$group: {
_id: "$word",
url: {
$addToSet: "$history._id",
},
},
},
{
$project: {
_id: 0,
word: "$_id",
count: {
$size: {
$reduce: {
input: "$url",
initialValue: [],
in: {
$concatArrays: ["$$value", "$$this"],
},
},
},
},
},
},
{
$sort: {
count: -1,
},
},
{
$limit: 50,
},
])
.toArray();
I think I need an index but not sure how or where to add.
Perhaps performance of this operation should be revisited after we confirm that it is satisfying the desired application logic that the approach itself is reasonable.
When it comes to performance, there is nothing that can be done to improve efficiency on the positive
collection if the intention is to process every document. By definition, processing all documents requires a full collection scan.
To efficiently support the $lookup
on the search_history
collection, you may wish to confirm that an index on { search_id: 1, created_at: 1, url: 1 }
exists. Providing the .explain("allPlansExecution")
output would allow us to better understand the current performance characteristics.
Updating the question to include details about the schemas and the purpose of the aggregation would be very helpful with respect to understanding the overall situation. Just looking at the aggregation, it appears to be doing the following:
positive
collection, add a new field called history
.url
values from the search_history
collection where the corresponding document has a matching search_id
value and was created_at
after last Sunday.history
field has at least one entry.word
. The $addToSet
operator is used here, but it may be generating an array of arrays rather than de-duplicated url
s.url
s and returning the top 50
results by word
sorted on that size in descending order.Is this what you want? In particular the following aspects may be worth confirming:
positive
collection? This may be the case, but it's impossible to tell without any schema/use-case context.url
s correct? It seems like you may need to use a $map
when doing the $addToSet
for the $group
instead of using $reduce
for the subsequent $project
.The best thing to do is to limit the number of documents passed to each stage. Indexes are used by mongo in aggregations only in the first stage only if it's a match, using 1 index max.
So the best thing to do is to have a match on an indexed field that is very restrictive.
Moreover, please note that $limit
, $skip
and $sample
are not panaceas because they still scan the entire collection.
A way to efficiently limit the number of documents selected on the first stage is to use a "pagination". You can make it work like this:
Once every X requests
Every request
used
and nbChunks
id:${used%nbChunks}
and id:${(used%nbChunks)+1}
respectively$match
with _id:{$gte: ObjectId(id0), $lt: ObjectId(id1)}) }
used
, if used > X
then update chunks Further optimisation
If using redis, supplement every key with ${cluster.worker.id}:
to avoid hot keys.
Notes
$lt
$match
being the first stage and _id being a field that is indexed, this first stage is really fast and limits to max Y documents scanned.Hope it help =) Make sure to ask more if needed =)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.