简体   繁体   English

Elasticsearch 分页的最佳方法是什么?

[英]What is the best approach for Elasticsearch pagination?

What is the best way to do a pagination using Elasticsearch?使用 Elasticsearch 进行分页的最佳方法是什么? Currently, I am working on an API that uses Elasticsearch in the backend with Python, and my index does not have much data, so by default we are doing the pagination in the frontend using JavaScript (and so far, we have do not have any problems).目前,我正在开发一个使用 Python 在后端使用 Elasticsearch 的 API,并且我的索引没有太多数据,所以默认情况下我们使用 JavaScript 在前端进行分页(到目前为止,我们还没有任何问题)。

I want to know for bigger indexes, what is the best way to handle pagination:我想知道更大的索引,处理分页的最佳方法是什么:

The default way of paginating over search results in Elasticsearch is using from / size parameters.在 Elasticsearch 中对搜索结果进行分页的默认方式是使用from / size参数。 This will, however, work only for the top 10k search results.但是,这仅适用于前 10k 搜索结果。

In case you need to go above that the way to go is search_after .如果您需要超越该方法,则可以使用search_after

In case you need to dump the entire index, and it contains more than 10k documents, use scroll API .如果您需要转储整个索引,并且它包含超过 10k 的文档,请使用scroll API

What's the difference?有什么不同?

All of these queries allow to retrieve portions of search results, but they have major differences.所有这些查询都允许检索部分搜索结果,但它们有很大的不同。

from/size is the cheapest and fastest, it is what Google would use to go for the second, third, etc. search results pages if it used Elasticsearch. from/size 是最便宜和最快的,如果 Google 使用 Elasticsearch,它将用于搜索第二、第三等搜索结果页面。

Scroll API is expensive, because it creates a kind of snapshot of the index the moment you create the first query, to make sure by the end of the scroll you will have exactly the data that was present in the index at the start.滚动 API 很昂贵,因为它会在您创建第一个查询的那一刻创建一种索引快照,以确保在滚动结束时您将拥有一开始就存在于索引中的数据。 Doing a scroll request will cost resources, and running many of them in parallel can kill your performance, so proceed with caution.执行滚动请求会消耗资源,并且并行运行其中许多会降低您的性能,因此请谨慎行事。

Search after instead is a half-way between the two:而是在两者之间进行搜索:

search_after is not a solution to jump freely to a random page but rather to scroll many queries in parallel. search_after不是自由跳转到随机页面的解决方案,而是并行滚动许多查询。 It is very similar to the scroll API but unlike it, the search_after parameter is stateless, it is always resolved against the latest version of the searcher.它与scroll API 非常相似,但不同的是, search_after参数是无状态的,它总是针对最新版本的搜索器进行解析。 For this reason the sort order may change during a walk depending on the updates and deletes of your index.因此,排序顺序可能会在遍历期间根据索引的更新和删除而改变。

So it will allow you to paginate above 10k, with a cost of some possible inconsistency.因此,它将允许您在 10k 以上进行分页,但代价是可能存在一些不一致。

Why the 10k limit?为什么有 10k 的限制?

index.max_result_window is set to 10k as a hard limit to avoid out of memory situations : index.max_result_window设置为 10k 作为硬限制以避免内存不足的情况

index.max_result_window index.max_result_window

The maximum value of from + size for searches to this index.搜索到此索引的from + size的最大值。 Defaults to 10000. Search requests take heap memory and time proportional to from + size and this limits that memory.默认为 10000。搜索请求占用堆内存和时间与from + size成正比,这限制了该内存。

What about sliced scroll?切片卷轴呢?

Sliced scroll is just a faster way of doing a normal scroll: it allows to download the collection of documents in parallel. 切片滚动只是进行普通滚动的一种更快的方式:它允许并行下载文档集合。 Slice is just a subset of documents in the scroll query output.切片只是滚动查询输出中文档的一个子集。

Search after is the recommended method for large result sets. Search after 是针对大型结果集的推荐方法。

The general idea is to sort the result according to a specific column, and fetch up to 10K records.Then get the value of the column for the last record, and fetch the next time with column > "last value".大体思路是按照特定的列对结果进行排序,最多取10K条记录,然后取最后一条记录的列值,下次取列>“最后一个值”。

You can wrap this in you own python function, and make the entire pagination transparent to the application.您可以将其包装在您自己的 python 函数中,并使整个分页对应用程序透明。

See example in this post: https://runkiss.blogspot.com/2021/11/create-graph-in-pdf-from-elasticsearch.html请参阅这篇文章中的示例: https : //runkiss.blogspot.com/2021/11/create-graph-in-pdf-from-elasticsearch.html

    response_array = []

response = ElkConfigClient.search index: "index_name",  
      body: {
      query: { 
        bool: { 
          must: [
            "search_query"
          ]
        }
      }
    },
      scroll: '1h', 
      size: 1000

    scroll_id = response["_scroll_id"]
    s_id = scroll_id

     #iterate the response
    response["hits"]["hits"].each do |response|
      response_array.push(response)
    end
    
    while (true)        
      next_response = ElkConfigClient.scroll(scroll_id: s_id, scroll: '1h')
      next_scroll_id = next_response["_scroll_id"]
      s_id = next_scroll_id

      break if next_response["hits"]["hits"].length == 0 

      next_response["hits"]["hits"].each do |response|
        response_array.push(response)
      end
      response_array
    end

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM