[英]Elasticsearch paginating a sorted, aggregated result
As far as I'm aware, there isn't a way to do something like the following in Elasticsearch: 据我所知,没有办法在Elasticsearch中执行以下操作:
SELECT * FROM myindex
GROUP BY agg_field1, agg_field2, agg_field3 // aggregation
ORDER BY order_field1, order_field2, order_field3 // sort
LIMIT 1000, 5000 // paginate -- get page 6 of size 1000 records
Here are some related documents regarding this: 以下是一些有关此问题的相关文件:
Is there a way to do the above in Elasticsearch? 有没有办法在Elasticsearch中执行上述操作? The one limitation we have is we will never have more than 10M records, so we (hopefully) shouldn't run into memory errors. 我们的一个限制是我们永远不会有超过10M的记录,所以我们(希望)不应该遇到内存错误。 My thinking was to do it as follows: 我的想法是这样做:
What would be the best way to accomplish this? 实现这一目标的最佳方法是什么? In your answer/suggestion, could you please post some sample code relating to how the above SQL query could be done in ES? 在您的回答/建议中,您能否发布一些有关如何在ES中完成上述SQL查询的示例代码?
As an update to this question, here is a public index to test with: 作为此问题的更新,这是一个公共索引,用于测试:
# 5.6
e=Elasticsearch('https://search-testinges-fekocjpedql2f3rneuagyukvy4.us-west-1.es.amazonaws.com')
e.search('testindex')
# 6.4 (same data as above)
e = Elasticsearch('https://search-testinges6-fycj5kjd7l5uyo6npycuashch4.us-west-1.es.amazonaws.com')
e.search('testindex6')
It has 10,000 records. 它有10,000条记录。 Feel free to test with it: 随意测试:
The query that I'm looking to do is as follows (in sql): 我正在寻找的查询如下(在sql中):
SELECT * FROM testindex
GROUP BY store_url, status, title
ORDER BY title ASC, status DESC
LIMIT 100 OFFSET 6000
In other words, I'm looking to sort an aggregated result (with multiple aggregations) and get an offset. 换句话说,我希望对聚合结果(具有多个聚合)进行排序并获得偏移量。
The composite
aggregation might help here as it allows you to group by multiple fields and then paginate over the results. composite
聚合可能在这里有所帮助,因为它允许您按多个字段进行分组,然后对结果进行分页。 The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary. 它不允许你做的唯一事情就是跳过一个给定的偏移量,但你可以通过从客户端代码迭代来做到这一点,如果有必要的话。
So here is a sample query to do that: 所以这是一个示例查询:
POST testindex6/_search
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 100,
"sources": [
{
"store": {
"terms": {
"field": "store_url"
}
}
},
{
"status": {
"terms": {
"field": "status",
"order": "desc"
}
}
},
{
"title": {
"terms": {
"field": "title",
"order": "asc"
}
}
}
]
},
"aggs": {
"hits": {
"top_hits": {
"size": 100
}
}
}
}
}
}
In the response you'll see and after_key
structure: 在响应中你会看到和after_key
结构:
"after_key": {
"store": "http://google.com1087",
"status": "OK1087",
"title": "Titanic1087"
},
It's some kind of cursor that you need to use in your subsequent queries, like this: 这是您需要在后续查询中使用的某种游标,如下所示:
{
"size": 0,
"aggs": {
"my_buckets": {
"composite": {
"size": 100,
"sources": [
{
"store": {
"terms": {
"field": "store_url"
}
}
},
{
"status": {
"terms": {
"field": "status",
"order": "desc"
}
}
},
{
"title": {
"terms": {
"field": "title",
"order": "asc"
}
}
}
],
"after": {
"store": "http://google.com1087",
"status": "OK1087",
"title": "Titanic1087"
}
},
"aggs": {
"hits": {
"top_hits": {
"size": 100
}
}
}
}
}
}
And it will give you the next 100 buckets. 它将为您提供接下来的100个桶。 Hopefully this helps. 希望这会有所帮助。
UPDATE : 更新 :
If you want to know how many buckets in total there is going to be, the composite
aggregation won't give you that number. 如果你想知道总共会有多少桶,那么composite
聚合将不会给你这个数字。 However, since the composite
aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality]( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html ) of each field used in the composite
aggregation and multiplying them together. 但是,由于composite
聚合只不过是其来源中所有字段的笛卡尔积,您可以通过返回]基数]来获得该总数的良好近似值( https://www.elastic.co/guide /en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html ) composite
聚合中使用的每个字段并将它们相乘。
"aggs": {
"my_buckets": {
"composite": {
...
}
},
"store_cardinality": {
"cardinality": {
"field": "store_url"
}
},
"status_cardinality": {
"cardinality": {
"field": "status"
}
},
"title_cardinality": {
"cardinality": {
"field": "title"
}
}
}
We can then get the total number of buckets by multiplying the figure we get in store_cardinality
, status_cardinality
and title_cardinality
together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones). 然后我们可以通过将store_cardinality
, status_cardinality
和title_cardinality
的数字相乘,或者至少得到一个很好的近似值来得到桶的总数(它在高基数字段上不能很好地工作,但在低基数上很好那些)。
Field collapsing is the answer. 现场崩溃就是答案。
Field collapsing feature is used when we want to group the hits on a specific field (as in group by agg_field). 当我们想要在特定字段上对命中进行分组时(如在agg_field中的组中),将使用字段折叠功能。
Before Elastic 6, the way to group the fields is to use aggregation . 在Elastic 6之前,对字段进行分组的方法是使用聚合 。 This approach was lacking an ability to do efficient paging. 这种方法缺乏高效分页的能力。
But now, with the field collapse provided out of the box by elastic, it is pretty easy. 但是现在,通过弹性提供开箱即用的场地坍塌,这很容易。
Below is a sample query with field collapse taken from above link. 以下是从上面的链接获取字段折叠的示例查询。
GET /twitter/_search
{
"query": {
"match": {
"message": "elasticsearch"
}
},
"collapse" : {
"field" : "user",
"inner_hits": {
"name": "last_tweets",
"size": 5,
"sort": [{ "date": "asc" }]
},
"max_concurrent_group_searches": 4
},
"sort": ["likes"]
} }
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.