[英]Elasticsearch - sub aggregation with a large amount of buckets
We have an index with a large amount of user reports (millions / 10s of millions).我们有一个包含大量用户报告的索引(百万/千万)。 Assuming most users have reports, we need to calculate some statistics per-user.假设大多数用户都有报告,我们需要计算每个用户的一些统计数据。 For example, how many users have an average of between 10-15, 15-20, 20-30 reports per week in a specific time interval.例如,有多少用户在特定时间间隔内每周平均有 10-15、15-20、20-30 份报告。 Please note that we don't need to return the buckets themselves in the response, but they need to be evaluated by the sub aggregations that calculate the average & ranges.请注意,我们不需要在响应中返回存储桶本身,但它们需要由计算平均值和范围的子聚合进行评估。 To my understanding, elasticsearch has a limit to the number of buckets, and it's not recommended to increase it to millions.据我了解,elasticsearch对bucket的数量有限制,不建议增加到百万。 I've read about the composite aggregation for pagination, but I don't think this is suitable for this scenario, since we need to calculate aggregate numbers and not return the buckets.我已经阅读了关于分页的复合聚合,但我认为这不适合这种情况,因为我们需要计算聚合数字而不是返回存储桶。
Below is a simplified version of our current query.下面是我们当前查询的简化版本。 We want to calculate the number of uses that have between X1-X2 monthly reports between 2 dates.我们想要计算在 2 个日期之间的 X1-X2 月度报告之间的使用次数。
The problem is that the original bucketing (terms aggregation) will only return a relatively small amount of buckets (not millions), so only a small amount of users will be evaluated at all.问题是原始分桶(术语聚合)只会返回相对少量的桶(不是数百万),因此只会评估少量用户。 What would be the best way to achieve this?实现这一目标的最佳方法是什么?
POST /reports/_search
{
"size": 0,
"query": {
"range": {
"timestamp": {
"gte": "2020-01-01T00:00:00.000Z",
"lte": "2020-12-24T23:59:59.999Z",
"format": "strict_date_optional_time"
}
}
},
"aggs": {
"distinctIds_less_than_monthly": {
"terms": {
"field": "userId" // this will only return a small amount of buckets
},
"aggs": {
"less_than_monthly": {
"bucket_selector": {
"buckets_path": {
"distinctUsers": "distinctUsers_less_than_monthly.value"
},
"script": "params.distinctUsers > 1000 && params.distinctUsers < 1500"
}
},
"distinctUsers_less_than_monthly": {
"value_count": {
"field": "userId"
}
}
}
},
"userCount_less_than_monthly": {
"stats_bucket": {
"buckets_path": "distinctIds_less_than_monthly._count"
}
}
}
}
I see essentially 3 optimizations, all of which share a map → combine approach:我基本上看到了 3 个优化,所有这些都共享一个 map → 组合方法:
terms
aggs within those groups.例如,通过姓名首字母对用户进行预分组,并在这些组中运行terms
aggs。 Then combine.然后结合。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.