简体   繁体   中英

Elasticsearch - sub aggregation with a large amount of buckets

We have an index with a large amount of user reports (millions / 10s of millions). Assuming most users have reports, we need to calculate some statistics per-user. For example, how many users have an average of between 10-15, 15-20, 20-30 reports per week in a specific time interval. Please note that we don't need to return the buckets themselves in the response, but they need to be evaluated by the sub aggregations that calculate the average & ranges. To my understanding, elasticsearch has a limit to the number of buckets, and it's not recommended to increase it to millions. I've read about the composite aggregation for pagination, but I don't think this is suitable for this scenario, since we need to calculate aggregate numbers and not return the buckets.

Below is a simplified version of our current query. We want to calculate the number of uses that have between X1-X2 monthly reports between 2 dates.

  1. bucket the reports by user id.
  2. use bucket selector to select only the users which have between YZ reports (Y1 & Y2 are pre-calculated by the client - these are the numbers which will resolve to an average of between X1-X2 monthly reports).
  3. count the number of buckets left

The problem is that the original bucketing (terms aggregation) will only return a relatively small amount of buckets (not millions), so only a small amount of users will be evaluated at all. What would be the best way to achieve this?

POST /reports/_search
{
     "size": 0,
     "query": {
         "range": {
             "timestamp": {
                 "gte": "2020-01-01T00:00:00.000Z",
                 "lte": "2020-12-24T23:59:59.999Z",
                 "format": "strict_date_optional_time"
             }
         }
     },
     "aggs": {
         "distinctIds_less_than_monthly": {
             "terms": {
                 "field": "userId" // this will only return a small amount of buckets
             },
             "aggs": {
                 "less_than_monthly": {
                     "bucket_selector": {
                         "buckets_path": {
                             "distinctUsers": "distinctUsers_less_than_monthly.value"
                         },
                         "script": "params.distinctUsers > 1000 && params.distinctUsers < 1500"
                     }
                 },
                 "distinctUsers_less_than_monthly": {
                     "value_count": {
                         "field": "userId"
                     }
                 }                 
             }
         },
         "userCount_less_than_monthly": {
             "stats_bucket": {
                 "buckets_path": "distinctIds_less_than_monthly._count"
             }
         }
     }
}

I see essentially 3 optimizations, all of which share a map → combine approach:

  1. Write a script in the language of your choice to split the 1Y range into months/weeks, run the queries, and combine the results.
  2. Apply some sort of a filter before you run the terms aggregations -- calculate the user stats for your most valuable users first (pick them by revenue, their daily active usage etc.), and then for the rest. Then combine.
  3. Pre-group the users by, say, the forename initials, and run the terms aggs within those groups. Then combine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM