简体   繁体   English

Elasticsearch对已排序的聚合结果进行分页

[英]Elasticsearch paginating a sorted, aggregated result

As far as I'm aware, there isn't a way to do something like the following in Elasticsearch: 据我所知,没有办法在Elasticsearch中执行以下操作:

SELECT * FROM myindex
GROUP BY agg_field1, agg_field2, agg_field3 // aggregation
ORDER BY order_field1, order_field2, order_field3 // sort
LIMIT 1000, 5000 // paginate -- get page 6 of size 1000 records

Here are some related documents regarding this: 以下是一些有关此问题的相关文件:

Is there a way to do the above in Elasticsearch? 有没有办法在Elasticsearch中执行上述操作? The one limitation we have is we will never have more than 10M records, so we (hopefully) shouldn't run into memory errors. 我们的一个限制是我们永远不会有超过10M的记录,所以我们(希望)不应该遇到内存错误。 My thinking was to do it as follows: 我的想法是这样做:

  • Do an aggregation query 进行聚合查询
  • Get the number of results from it 从中获取结果数量
  • Split it into N segments based on the results and page size we want 根据我们想要的结果和页面大小将其拆分为N个段
  • Rerun the query with the above segments 使用上述段重新运行查询

What would be the best way to accomplish this? 实现这一目标的最佳方法是什么? In your answer/suggestion, could you please post some sample code relating to how the above SQL query could be done in ES? 在您的回答/建议中,您能否发布一些有关如何在ES中完成上述SQL查询的示例代码?


As an update to this question, here is a public index to test with: 作为此问题的更新,这是一个公共索引,用于测试:

# 5.6
e=Elasticsearch('https://search-testinges-fekocjpedql2f3rneuagyukvy4.us-west-1.es.amazonaws.com')
e.search('testindex')

# 6.4 (same data as above)
e = Elasticsearch('https://search-testinges6-fycj5kjd7l5uyo6npycuashch4.us-west-1.es.amazonaws.com')
e.search('testindex6')

It has 10,000 records. 它有10,000条记录。 Feel free to test with it: 随意测试:

在此输入图像描述

The query that I'm looking to do is as follows (in sql): 我正在寻找的查询如下(在sql中):

SELECT * FROM testindex
GROUP BY store_url, status, title
ORDER BY title ASC, status DESC
LIMIT 100 OFFSET 6000

In other words, I'm looking to sort an aggregated result (with multiple aggregations) and get an offset. 换句话说,我希望对聚合结果(具有多个聚合)进行排序并获得偏移量。

The composite aggregation might help here as it allows you to group by multiple fields and then paginate over the results. composite聚合可能在这里有所帮助,因为它允许您按多个字段进行分组,然后对结果进行分页。 The only thing that it doesn't let you do is to jump at a given offset, but you can do that by iterating from your client code if at all necessary. 它不允许你做的唯一事情就是跳过一个给定的偏移量,但你可以通过从客户端代码迭代来做到这一点,如果有必要的话。

So here is a sample query to do that: 所以这是一个示例查询:

POST testindex6/_search
{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

In the response you'll see and after_key structure: 在响应中你会看到和after_key结构:

  "after_key": {
    "store": "http://google.com1087",
    "status": "OK1087",
    "title": "Titanic1087"
  },

It's some kind of cursor that you need to use in your subsequent queries, like this: 这是您需要在后续查询中使用的某种游标,如下所示:

{
  "size": 0,
  "aggs": {
    "my_buckets": {
      "composite": {
        "size": 100,
        "sources": [
          {
            "store": {
              "terms": {
                "field": "store_url"
              }
            }
          },
          {
            "status": {
              "terms": {
                "field": "status",
                "order": "desc"
              }
            }
          },
          {
            "title": {
              "terms": {
                "field": "title",
                "order": "asc"
              }
            }
          }
        ],
        "after": {
          "store": "http://google.com1087",
          "status": "OK1087",
          "title": "Titanic1087"
        }
      },
      "aggs": {
        "hits": {
          "top_hits": {
            "size": 100
          }
        }
      }
    }
  }
}

And it will give you the next 100 buckets. 它将为您提供接下来的100个桶。 Hopefully this helps. 希望这会有所帮助。

UPDATE : 更新

If you want to know how many buckets in total there is going to be, the composite aggregation won't give you that number. 如果你想知道总共会有多少桶,那么composite聚合将不会给你这个数字。 However, since the composite aggregation is nothing else than a cartesian product of all the fields in its sources, you can get a good approximation of that total number by also returning the ]cardinality]( https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.html ) of each field used in the composite aggregation and multiplying them together. 但是,由于composite聚合只不过是其来源中所有字段的笛卡尔积,您可以通过返回]基数]来获得该总数的良好近似值( https://www.elastic.co/guide /en/elasticsearch/reference/current/search-aggregations-metrics-cardinality-aggregation.htmlcomposite聚合中使用的每个字段并将它们相乘。

  "aggs": {
    "my_buckets": {
      "composite": {
        ...
      }
    },
    "store_cardinality": {
      "cardinality": {
        "field": "store_url"
      }
    },
    "status_cardinality": {
      "cardinality": {
        "field": "status"
      }
    },
    "title_cardinality": {
      "cardinality": {
        "field": "title"
      }
    }
  }

We can then get the total number of buckets by multiplying the figure we get in store_cardinality , status_cardinality and title_cardinality together, or at least a good approximation thereof (it won't work well on high-cardinality fields, but pretty well on low-cardinality ones). 然后我们可以通过将store_cardinalitystatus_cardinalitytitle_cardinality的数字相乘,或者至少得到一个很好的近似值来得到桶的总数(它在高基数字段上不能很好地工作,但在低基数上很好那些)。

Field collapsing is the answer. 现场崩溃就是答案。

Field collapsing feature is used when we want to group the hits on a specific field (as in group by agg_field). 当我们想要在特定字段上对命中进行分组时(如在agg_field中的组中),将使用字段折叠功能。

Before Elastic 6, the way to group the fields is to use aggregation . 在Elastic 6之前,对字段进行分组的方法是使用聚合 This approach was lacking an ability to do efficient paging. 这种方法缺乏高效分页的能力。

But now, with the field collapse provided out of the box by elastic, it is pretty easy. 但是现在,通过弹性提供开箱即用的场地坍塌,这很容易。

Below is a sample query with field collapse taken from above link. 以下是从上面的链接获取字段折叠的示例查询。

GET /twitter/_search
{
  "query": {
      "match": {
          "message": "elasticsearch"
      }
  },
  "collapse" : {
      "field" : "user", 
      "inner_hits": {
          "name": "last_tweets", 
          "size": 5, 
          "sort": [{ "date": "asc" }] 
      },
      "max_concurrent_group_searches": 4 
  },
  "sort": ["likes"]

} }

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将多个单个查询合并为一个以在Elasticsearch中获得汇总结果 - Combine multiple individual queries into one to get aggregated result in Elasticsearch ElasticSearch 的聚合输出 - Aggregated output from ElasticSearch Python Elasticsearch 聚合查询 - Python Elasticsearch aggregated query 使用elasticsearch-dsl在python中获取按“ @timestamp”排序的结果 - Get result sorted by “@timestamp” in python using elasticsearch-dsl Elasticsearch查询返回奇怪的排序(基于分数)结果 - Elasticsearch query returns strange sorted (score based) result 通过返回的id结果数组的elasticsearch查询未按传入的id数组排序 - elasticsearch query by array of id result returned is not sorted by the array of id pass in 有没有办法在 Elasticsearch 中仅对所有文档的聚合结果应用后置过滤器? - is there any way to apply post filter in Elasticsearch on only aggregated result on on all the document? 通过Elasticsearch中的缓存搜索查询分页 - Paginating through a cached search query in elasticsearch 通过ElasticSearch 6中的子聚合进行筛选,排序和分页 - Filtering, sorting and paginating by sub-aggregations in ElasticSearch 6 Elasticsearch汇总字段上的突出问题 - Elasticsearch Highlight issue on aggregated fields
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM