We have an index with a large amount of user reports (millions / 10s of millions). Assuming most users have reports, we need to calculate some statistics per-user. For example, how many users have an average of between 10-15, 15-20, 20-30 reports per week in a specific time interval. Please note that we don't need to return the buckets themselves in the response, but they need to be evaluated by the sub aggregations that calculate the average & ranges. To my understanding, elasticsearch has a limit to the number of buckets, and it's not recommended to increase it to millions. I've read about the composite aggregation for pagination, but I don't think this is suitable for this scenario, since we need to calculate aggregate numbers and not return the buckets.
Below is a simplified version of our current query. We want to calculate the number of uses that have between X1-X2 monthly reports between 2 dates.
The problem is that the original bucketing (terms aggregation) will only return a relatively small amount of buckets (not millions), so only a small amount of users will be evaluated at all. What would be the best way to achieve this?
POST /reports/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2020-01-01T00:00:00.000Z",
"lte": "2020-12-24T23:59:59.999Z",
"format": "strict_date_optional_time"
}
}
},
"aggs": {
"distinctIds_less_than_monthly": {
"terms": {
"field": "userId" // this will only return a small amount of buckets
},
"aggs": {
"less_than_monthly": {
"bucket_selector": {
"buckets_path": {
"distinctUsers": "distinctUsers_less_than_monthly.value"
},
"script": "params.distinctUsers > 1000 && params.distinctUsers < 1500"
}
},
"distinctUsers_less_than_monthly": {
"value_count": {
"field": "userId"
}
}
}
},
"userCount_less_than_monthly": {
"stats_bucket": {
"buckets_path": "distinctIds_less_than_monthly._count"
}
}
}
}
I see essentially 3 optimizations, all of which share a map → combine approach:
terms
aggs within those groups. Then combine.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.