简体   繁体   English

在ElasticSearch上索引聚合数据的最佳方法是什么

[英]What is the best way to index aggregate data on ElasticSearch

I have users and my users have events. 我有用户,我的用户有事件。 Each event as a type and date on each the event happened. 每个事件作为事件发生的类型和日期。

For example 例如

{
  id: 1,
  name: john,
  events: [{
    type: 'logged_in'
    date: "01/01/2016
  },{
    type: 'logged_in'
    date: "02/01/2016
  }{
    type: 'added_email'
    date: "02/05/2016
  }]
}

Now the issue is that I would like to able to query users that have done a specific event N times for a specific time frame 现在的问题是,我希望能够查询在特定时间段内完成N次特定事件的用户

For example: Which users logged in more than twice between Jan 1 16 and Jan 20 17 例如:在1月1日至1月20日之间哪些用户登录了两次以上

I know I can use aggregates but the query gets complex and performance drops on million of events. 我知道我可以使用聚合,但是查询变得很复杂,并且性能下降了数百万个事件。

I was wondering if there is a better way to index/query this data? 我想知道是否有更好的方法来索引/查询此数据?

The obvious way of representing this data is with a nested mapping: 表示此数据的明显方法是使用嵌套映射:

"id": {"type": "integer"},
"name": {"type": "keyword"},
"events": {
  "type": "nested",
  "properties": {
    "type": {"type": "keyword"},
    "date": {"type": "date"}
  }    

I think this is what you are talking about when you mention performance issues (nested queries and aggregations are slow). 我认为这就是您提到性能问题(嵌套查询和聚合速度很慢)时要说的。 For the kind of analysis you're talking about, I don't think you can avoid using an aggregation, but I would "flatten" the data to avoid using nested fields[1], with one document per record instead, like this: 对于您正在谈论的那种分析,我认为您不能避免使用聚合,但是我会“扁平化”数据以避免使用嵌套字段[1],而是每条记录只包含一个文档,如下所示:

"id": {"type": "integer"},
"name": {"type": "keyword"},
"event_type": {"type": "keyword"},
"date": {"type": "date"}

And then do an aggregation like: 然后进行如下聚合:

{
  "query": {"bool": {
    "filter": [
      {"match": {"event_type": "logged_in"}},
      {"range": {"date": {"gte": "2016-01-01", "lt": "2017-01-20"}}}
    }
  "aggs": {
    "terms": {
      "field": "name",
      "size": 50
    }
  }

You can also aggregate your data some in your index, in case you know you'll never need more fine-grained analysis. 您也可以在索引中汇总一些数据,以防万一您永远不需要更细粒度的分析。 Like for example: 例如:

"name": {"type": "keyword"},
"event_type": {"type": "keyword"},
"event_count": {"type": "integer"},
"date_bucket": {"type": "date"}

where date_bucket represents the beginning of the date bucket (like if you only care about full months, then every event for January will go into the record for "2017-01-01"). 其中date_bucket代表日期存储区的开始(例如,如果您只关心整个月,那么1月的每个事件都将进入“ 2017-01-01”的记录)。 You can use a scripted updated with upsert to update the event_count in case it already exists, or create a new doc if it doesn't. 如果event_count已经存在,则可以使用upsert脚本更新它来更新event_count;如果不存在,则可以创建一个新文档。 Then you can use a sum aggregation over event_count inside a terms aggregation instead. 然后,您可以改为在terms汇总内的event_count terms汇总汇总。 This really only makes sense if you know in advance which granularity you care about. 仅当您事先知道要关注的粒度时,这才有意义。

[1] If you also need to access the data in a different way, you might consider indexing into two indices, like two "views" on the data. [1]如果还需要以其他方式访问数据,则可以考虑建立两个索引的索引,例如数据上的两个“视图”。 Basically unless you have infinite resources, or small dataset, or you don't care much about performance, you should work really hard to avoid nested fields. 基本上,除非您拥有无限的资源或小的数据集,或者您不太在乎性能,否则您应该非常努力地避免嵌套字段。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Elasticsearch上索引数据的最佳方法是什么? - What is the best way to index data on elasticsearch? 什么是在laravel中向elasticsearch添加数据的最佳方法 - what is best way to add data to elasticsearch in laravel 在Elasticsearch中创建我的数据子集的最佳方法是什么? - What is the best way to create a subset of my data in Elasticsearch? 在Elastic Search上索引Couchbase数据的最佳方法是什么? - What is the best way to index Couchbase data on Elastic Search elasticsearch可以定期汇总数据并保存为其他索引数据吗? - can elasticsearch periodically aggregate data and save it as other index data? 增加 ElasticSearch 集群磁盘空间的最佳方法是什么? - what is the best way to increase diskpace for ElasticSearch cluster ? 在Elasticsearch中查询此字段的最佳方法是什么 - What is the best way of querying this field in Elasticsearch 在ElasticSearch中管理关系的最佳方法是什么? - What is the best way to manage relations in ElasticSearch? 在Django Restframework中使用elasticsearch的最佳方法是什么 - what is the best way to use elasticsearch in Django Restframework 压缩 Elasticsearch 快照的最佳方法是什么? - What is the best way to compress Elasticsearch snapshot?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM