简体   繁体   English

如何在 Elasticsearch 中对术语聚合结果进行分页

[英]How to paginate terms aggregation results in Elasticsearch

I've been trying to figure out a way to paginate the results of a terms aggregation in Elasticsearch and so far I have not been able to achieve the desired result.我一直在想办法在 Elasticsearch 中对术语聚合的结果进行分页,但到目前为止我还没有达到预期的结果。

Here's the problem I am trying to solve.这是我试图解决的问题。 In my index, I have a bunch of documents that have a score (separate to the ES _score) that is calculated based on the values of the other fields in the document.在我的索引中,我有一堆文档,它们的分数(与 ES _score 分开)是根据文档中其他字段的值计算出来的。 Each document "belongs" to a customer, referenced by the customer_id field.每个文档“属于”一个客户,由customer_id字段引用。 The document also has an id, referenced by the doc_id field, and is the same as the ES meta-field _id .该文档还有一个 id,由doc_id字段引用,与 ES 元字段_id 相同 Here is an example.这是一个例子。

{
 '_id': '1',
 'doc_id': '1',
 'doc_score': '85',
 'customer_id': '123'
}

For each customer_id there are multiple documents, all with different document ids and different scores.对于每个customer_id有多个文档,所有文档都有不同的文档 ID 和不同的分数。 What I want to be able to do is, given a list of customer ids, return the top document for each customer_id (only 1 per customer) and be able to paginate those results similar to the size , from method in the regular ES search API.我希望能够做的是,给定客户 ID 列表,返回每个 customer_id 的顶部文档(每个客户只有 1 个),并且能够对与size类似的结果进行分页,来自常规 ES 搜索 API 中的方法. The field that I want to use for the document score is the doc_score field.我想用于文档分数的字段是doc_score字段。

So far in my current Python script, I've tried is a nested aggs with a "top hits" aggregation to only get the top document for each customer.到目前为止,在我当前的Python脚本中,我尝试使用嵌套的 aggs 和“热门点击”聚合来仅获取每个客户的顶级文档。

{
 "size": 0,
 "query:": {
  "bool": {
   "must": [
    {
     "match_all": {}
    },
    {
     "terms": {
      "customer_id": customer_ids # a list of the customer ids I want documents for
     }
    },
    {
     "exists": {
      "field": "score" # sometimes it's possible a document does not have a score
     }
    }
   ]
  }
 }
 "aggs": {
  "customers": {
   "terms" : {
    {"field": "customer_id", "min_doc_count": 1},
    "aggs": {
     "top_documents": {
      "top_hits": {
       "sort": [
        {"score": {"order": "desc"}}
       ],
       "size": 1
      }
     }
    }
   }
  }
 }
}

I then "paginate" by going through each customer bucket, appending the top document blob to a list and then sorting the list based on the value of the score field and finally taking a slice documents_list[from:from+size] .然后我通过遍历每个客户存储桶来“分页”,将顶部文档 blob 附加到一个列表,然后根据score字段的值对列表进行排序,最后取一个切片documents_list[from:from+size]

The issue with this is that, say I have 500 customers in the list but I only want the 2nd 20 documents, ie size = 20 , from=20 .问题在于,假设我在列表中有 500 个客户,但我只想要第二个 20 个文档,即size = 20from=20 So each time I call the function I have to first get the list for each of the 500 customers and then slice.因此,每次调用该函数时,我都必须先获取 500 个客户中的每个客户的列表,然后再进行切片。 This sounds very inefficient and is also a speed issue, since I need that function to be as fast as I can possibly make it.这听起来非常低效,而且也是一个速度问题,因为我需要该功能尽可能快。

Ideally, I could just get the 2nd 20 directly from ES without having to do any slicing in my function.理想情况下,我可以直接从 ES 获取第 2 个 20,而无需在我的函数中进行任何切片。

I have looked into Composite aggregations that ES offers, but it looks to me like I would not be able to use it in my case, since I need to get the entire doc, ie everything in the _source field in the regular search API response.我研究了 ES 提供的复合聚合,但在我看来,我无法在我的情况下使用它,因为我需要获取整个文档,即常规搜索 API 响应中_source字段中的所有内容。

I would greatly appreciate any suggestions.我将不胜感激任何建议。

The best way to do this would be to use partitions最好的方法是使用分区

According to documentation:根据文档:

GET /_search
{
   "size": 0,
   "aggs": {
      "expired_sessions": {
         "terms": {
            "field": "account_id",
            "include": {
               "partition": 1,
               "num_partitions": 25
            },
            "size": 20,
            "order": {
               "last_access": "asc"
            }
         },
         "aggs": {
            "last_access": {
               "max": {
                  "field": "access_date"
               }
            }
         }
      }
   }
}

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-terms-aggregation.html#_filtering_values_with_partitions

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM