[英]What is the best way to index aggregate data on ElasticSearch
I have users and my users have events. 我有用户,我的用户有事件。 Each event as a type and date on each the event happened. 每个事件作为事件发生的类型和日期。
For example 例如
{
id: 1,
name: john,
events: [{
type: 'logged_in'
date: "01/01/2016
},{
type: 'logged_in'
date: "02/01/2016
}{
type: 'added_email'
date: "02/05/2016
}]
}
Now the issue is that I would like to able to query users that have done a specific event N times for a specific time frame 现在的问题是,我希望能够查询在特定时间段内完成N次特定事件的用户
For example: Which users logged in more than twice between Jan 1 16 and Jan 20 17 例如:在1月1日至1月20日之间哪些用户登录了两次以上
I know I can use aggregates but the query gets complex and performance drops on million of events. 我知道我可以使用聚合,但是查询变得很复杂,并且性能下降了数百万个事件。
I was wondering if there is a better way to index/query this data? 我想知道是否有更好的方法来索引/查询此数据?
The obvious way of representing this data is with a nested mapping: 表示此数据的明显方法是使用嵌套映射:
"id": {"type": "integer"},
"name": {"type": "keyword"},
"events": {
"type": "nested",
"properties": {
"type": {"type": "keyword"},
"date": {"type": "date"}
}
I think this is what you are talking about when you mention performance issues (nested queries and aggregations are slow). 我认为这就是您提到性能问题(嵌套查询和聚合速度很慢)时要说的。 For the kind of analysis you're talking about, I don't think you can avoid using an aggregation, but I would "flatten" the data to avoid using nested fields[1], with one document per record instead, like this: 对于您正在谈论的那种分析,我认为您不能避免使用聚合,但是我会“扁平化”数据以避免使用嵌套字段[1],而是每条记录只包含一个文档,如下所示:
"id": {"type": "integer"},
"name": {"type": "keyword"},
"event_type": {"type": "keyword"},
"date": {"type": "date"}
And then do an aggregation like: 然后进行如下聚合:
{
"query": {"bool": {
"filter": [
{"match": {"event_type": "logged_in"}},
{"range": {"date": {"gte": "2016-01-01", "lt": "2017-01-20"}}}
}
"aggs": {
"terms": {
"field": "name",
"size": 50
}
}
You can also aggregate your data some in your index, in case you know you'll never need more fine-grained analysis. 您也可以在索引中汇总一些数据,以防万一您永远不需要更细粒度的分析。 Like for example: 例如:
"name": {"type": "keyword"},
"event_type": {"type": "keyword"},
"event_count": {"type": "integer"},
"date_bucket": {"type": "date"}
where date_bucket
represents the beginning of the date bucket (like if you only care about full months, then every event for January will go into the record for "2017-01-01"). 其中date_bucket
代表日期存储区的开始(例如,如果您只关心整个月,那么1月的每个事件都将进入“ 2017-01-01”的记录)。 You can use a scripted updated with upsert to update the event_count in case it already exists, or create a new doc if it doesn't. 如果event_count已经存在,则可以使用upsert脚本更新它来更新event_count;如果不存在,则可以创建一个新文档。 Then you can use a sum aggregation over event_count
inside a terms
aggregation instead. 然后,您可以改为在terms
汇总内的event_count
terms
汇总汇总。 This really only makes sense if you know in advance which granularity you care about. 仅当您事先知道要关注的粒度时,这才有意义。
[1] If you also need to access the data in a different way, you might consider indexing into two indices, like two "views" on the data. [1]如果还需要以其他方式访问数据,则可以考虑建立两个索引的索引,例如数据上的两个“视图”。 Basically unless you have infinite resources, or small dataset, or you don't care much about performance, you should work really hard to avoid nested fields. 基本上,除非您拥有无限的资源或小的数据集,或者您不太在乎性能,否则您应该非常努力地避免嵌套字段。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.