在ElasticSearch上索引聚合数据的最佳方法是什么

Question

I have users and my users have events. 我有用户，我的用户有事件。 Each event as a type and date on each the event happened. 每个事件作为事件发生的类型和日期。

For example 例如

{
  id: 1,
  name: john,
  events: [{
    type: 'logged_in'
    date: "01/01/2016
  },{
    type: 'logged_in'
    date: "02/01/2016
  }{
    type: 'added_email'
    date: "02/05/2016
  }]
}

Now the issue is that I would like to able to query users that have done a specific event N times for a specific time frame 现在的问题是，我希望能够查询在特定时间段内完成N次特定事件的用户

For example: Which users logged in more than twice between Jan 1 16 and Jan 20 17 例如：在1月1日至1月20日之间哪些用户登录了两次以上

I know I can use aggregates but the query gets complex and performance drops on million of events. 我知道我可以使用聚合，但是查询变得很复杂，并且性能下降了数百万个事件。

I was wondering if there is a better way to index/query this data? 我想知道是否有更好的方法来索引/查询此数据？

Answer 1

The obvious way of representing this data is with a nested mapping: 表示此数据的明显方法是使用嵌套映射：

"id": {"type": "integer"},
"name": {"type": "keyword"},
"events": {
  "type": "nested",
  "properties": {
    "type": {"type": "keyword"},
    "date": {"type": "date"}
  }

I think this is what you are talking about when you mention performance issues (nested queries and aggregations are slow). 我认为这就是您提到性能问题（嵌套查询和聚合速度很慢）时要说的。 For the kind of analysis you're talking about, I don't think you can avoid using an aggregation, but I would "flatten" the data to avoid using nested fields[1], with one document per record instead, like this: 对于您正在谈论的那种分析，我认为您不能避免使用聚合，但是我会“扁平化”数据以避免使用嵌套字段[1]，而是每条记录只包含一个文档，如下所示：

"id": {"type": "integer"},
"name": {"type": "keyword"},
"event_type": {"type": "keyword"},
"date": {"type": "date"}

And then do an aggregation like: 然后进行如下聚合：

{
  "query": {"bool": {
    "filter": [
      {"match": {"event_type": "logged_in"}},
      {"range": {"date": {"gte": "2016-01-01", "lt": "2017-01-20"}}}
    }
  "aggs": {
    "terms": {
      "field": "name",
      "size": 50
    }
  }

You can also aggregate your data some in your index, in case you know you'll never need more fine-grained analysis. 您也可以在索引中汇总一些数据，以防万一您永远不需要更细粒度的分析。 Like for example: 例如：

"name": {"type": "keyword"},
"event_type": {"type": "keyword"},
"event_count": {"type": "integer"},
"date_bucket": {"type": "date"}

where date_bucket represents the beginning of the date bucket (like if you only care about full months, then every event for January will go into the record for "2017-01-01"). 其中date_bucket代表日期存储区的开始（例如，如果您只关心整个月，那么1月的每个事件都将进入“ 2017-01-01”的记录）。 You can use a scripted updated with upsert to update the event_count in case it already exists, or create a new doc if it doesn't. 如果event_count已经存在，则可以使用upsert脚本更新它来更新event_count；如果不存在，则可以创建一个新文档。 Then you can use a sum aggregation over event_count inside a terms aggregation instead. 然后，您可以改为在terms汇总内的event_count terms汇总汇总。 This really only makes sense if you know in advance which granularity you care about. 仅当您事先知道要关注的粒度时，这才有意义。

[1] If you also need to access the data in a different way, you might consider indexing into two indices, like two "views" on the data. [1]如果还需要以其他方式访问数据，则可以考虑建立两个索引的索引，例如数据上的两个“视图”。 Basically unless you have infinite resources, or small dataset, or you don't care much about performance, you should work really hard to avoid nested fields. 基本上，除非您拥有无限的资源或小的数据集，或者您不太在乎性能，否则您应该非常努力地避免嵌套字段。

在ElasticSearch上索引聚合数据的最佳方法是什么

问题描述

1 个解决方案

解决方案1
1 已采纳 2017-08-26 12:24:04

在ElasticSearch上索引聚合数据的最佳方法是什么

问题描述

1 个解决方案

解决方案1 1 已采纳 2017-08-26 12:24:04

解决方案1
1 已采纳 2017-08-26 12:24:04