Elasticsearch - 具有大量桶的子聚合

Question

We have an index with a large amount of user reports (millions / 10s of millions).我们有一个包含大量用户报告的索引（百万/千万）。 Assuming most users have reports, we need to calculate some statistics per-user.假设大多数用户都有报告，我们需要计算每个用户的一些统计数据。 For example, how many users have an average of between 10-15, 15-20, 20-30 reports per week in a specific time interval.例如，有多少用户在特定时间间隔内每周平均有 10-15、15-20、20-30 份报告。 Please note that we don't need to return the buckets themselves in the response, but they need to be evaluated by the sub aggregations that calculate the average & ranges.请注意，我们不需要在响应中返回存储桶本身，但它们需要由计算平均值和范围的子聚合进行评估。 To my understanding, elasticsearch has a limit to the number of buckets, and it's not recommended to increase it to millions.据我了解，elasticsearch对bucket的数量有限制，不建议增加到百万。 I've read about the composite aggregation for pagination, but I don't think this is suitable for this scenario, since we need to calculate aggregate numbers and not return the buckets.我已经阅读了关于分页的复合聚合，但我认为这不适合这种情况，因为我们需要计算聚合数字而不是返回存储桶。

Below is a simplified version of our current query.下面是我们当前查询的简化版本。 We want to calculate the number of uses that have between X1-X2 monthly reports between 2 dates.我们想要计算在 2 个日期之间的 X1-X2 月度报告之间的使用次数。

bucket the reports by user id.按用户 ID 存储报告。
use bucket selector to select only the users which have between YZ reports (Y1 & Y2 are pre-calculated by the client - these are the numbers which will resolve to an average of between X1-X2 monthly reports).使用桶选择器到 select 仅具有 YZ 报告之间的用户（Y1 和 Y2 由客户端预先计算 - 这些数字将解析为 X1-X2 月度报告之间的平均值）。
count the number of buckets left计算剩余的桶数

The problem is that the original bucketing (terms aggregation) will only return a relatively small amount of buckets (not millions), so only a small amount of users will be evaluated at all.问题是原始分桶（术语聚合）只会返回相对少量的桶（不是数百万），因此只会评估少量用户。 What would be the best way to achieve this?实现这一目标的最佳方法是什么？

POST /reports/_search
{
     "size": 0,
     "query": {
         "range": {
             "timestamp": {
                 "gte": "2020-01-01T00:00:00.000Z",
                 "lte": "2020-12-24T23:59:59.999Z",
                 "format": "strict_date_optional_time"
             }
         }
     },
     "aggs": {
         "distinctIds_less_than_monthly": {
             "terms": {
                 "field": "userId" // this will only return a small amount of buckets
             },
             "aggs": {
                 "less_than_monthly": {
                     "bucket_selector": {
                         "buckets_path": {
                             "distinctUsers": "distinctUsers_less_than_monthly.value"
                         },
                         "script": "params.distinctUsers > 1000 && params.distinctUsers < 1500"
                     }
                 },
                 "distinctUsers_less_than_monthly": {
                     "value_count": {
                         "field": "userId"
                     }
                 }                 
             }
         },
         "userCount_less_than_monthly": {
             "stats_bucket": {
                 "buckets_path": "distinctIds_less_than_monthly._count"
             }
         }
     }
}

Answer 1

I see essentially 3 optimizations, all of which share a map → combine approach:我基本上看到了 3 个优化，所有这些都共享一个 map → 组合方法：

Write a script in the language of your choice to split the 1Y range into months/weeks, run the queries, and combine the results.用您选择的语言编写脚本，将 1Y 范围拆分为月/周，运行查询并组合结果。
Apply some sort of a filter before you run the terms aggregations -- calculate the user stats for your most valuable users first (pick them by revenue, their daily active usage etc.), and then for the rest.在运行术语聚合之前应用某种过滤器——首先计算最有价值用户的用户统计数据（按收入、每日活跃使用量等选择他们），然后是 rest。 Then combine.然后结合。
Pre-group the users by, say, the forename initials, and run the terms aggs within those groups.例如，通过姓名首字母对用户进行预分组，并在这些组中运行terms aggs。 Then combine.然后结合。

Elasticsearch - 具有大量桶的子聚合

问题描述

1 个解决方案

解决方案1
1 2020-12-26 23:33:33

Elasticsearch - 具有大量桶的子聚合

问题描述

1 个解决方案

解决方案1 1 2020-12-26 23:33:33

解决方案1
1 2020-12-26 23:33:33