简体   繁体   English

弹性搜索滚动

[英]Elasticsearch Scroll

I am little bit confused over Elasticsearch by its scroll functionality.我对 Elasticsearch 的滚动功能有点困惑。 In elasticsearch is it possible to call search API everytime whenever the user scrolls on the result set?在 elasticsearch 中,每次用户在结果集上滚动时是否都可以调用搜索 API? From documentation从文档

"search_type" => "scan",    // use search_type=scan
"scroll" => "30s",          // how long between scroll requests. should be small!
"size" => 50,               // how many results *per shard* you want back

Is that mean it will perform search for every 30 seconds and returns all the sets of results until there is no records?这是否意味着它将每 30 秒执行一次搜索并返回所有结果集,直到没有记录?

For example my ES returns total 500 records.例如我的 ES 返回总共 500 条记录。 I am getting an data from ES as two sets of records each with 250 records.我从 ES 获取数据作为两组记录,每组记录有 250 条记录。 Is there any way I can display first set of 250 records first, when user scrolls then second set of 250 records.Please suggest有什么办法可以先显示第一组 250 条记录,当用户滚动时再显示第二组 250 条记录。请建议

What you are looking for is pagination. 您正在寻找的是分页。

You can achieve your objective by querying for a fixed size and setting the from parameter. 您可以通过查询固定大小并设置from参数来实现您的目标。 Since you want to set display in batches of 250 results, you can set size = 250 and with each consecutive query, increment the value of from by 250 . 由于要设置成250个结果的批量显示,因此可以设置size = 250并在每个连续查询中将from的值增加250

GET /_search?size=250                     ---- return first 250 results
GET /_search?size=250&from=250            ---- next 250 results 
GET /_search?size=250&from=500            ---- next 250 results

On the contrary, Scan & scroll lets you retrieve a large set of results with a single search and is ideally meant for operations like re-indexing data into a new index. 相反,“ Scan & scroll使您可以通过一次搜索来检索大量结果,并且理想地用于诸如将数据重新索引为新索引之类的操作。 Using it for displaying search results in real-time is not recommended. 不建议将其用于实时显示搜索结果。

To explain Scan & scroll briefly, what it essentially does is that it scans the index for the query provided with the scan request and returns a scroll_id . 为了简要说明“ Scan & scroll ,它的本质作用是扫描与扫描请求一起提供的查询的索引,并返回scroll_id This scroll_id can be passed to the next scroll request to return the next batch of results. 可以将此scroll_id传递给下一个滚动请求,以返回下一批结果。

Consider the following example- 考虑以下示例-

# Initialize the scroll
page = es.search(
  index = 'yourIndex',
  doc_type = 'yourType',
  scroll = '2m',
  search_type = 'scan',
  size = 1000,
  body = {
    # Your query's body
    }
)
sid = page['_scroll_id']
scroll_size = page['hits']['total']

# Start scrolling
while (scroll_size > 0):
  print "Scrolling..."
  page = es.scroll(scroll_id = sid, scroll = '2m')
  # Update the scroll ID
  sid = page['_scroll_id']
  # Get the number of results that we returned in the last scroll
  scroll_size = len(page['hits']['hits'])
  print "scroll size: " + str(scroll_size)
  # Do something with the obtained page

In above example, following events happen- 在上述示例中,发生了以下事件-

  • Scroller is initialized. 滚动条已初始化。 This returns the first batch of results along with the scroll_id 这将返回第一批结果以及scroll_id
  • For each subsequent scroll request, the updated scroll_id (received in the previous scroll request) is sent and next batch of results is returned. 对于每个后续滚动请求,将发送更新的scroll_id (在上一个滚动请求中接收到),并返回下一批结果。
  • Scroll time is basically the time for which the search context is kept alive. 滚动时间基本上是使搜索上下文保持活动状态的时间。 If the next scroll request is not sent within the set timeframe, the search context is lost and results will not be returned. 如果未在设置的时间范围内发送下一个滚动请求,则搜索上下文将丢失并且结果将不会返回。 This is why it should not be used for real-time results display for indexes with a huge number of docs. 这就是为什么不应将其用于包含大量文档的索引的实时结果显示的原因。

You are understanding wrong the purpose of the scroll property. 您误解了scroll属性的目的。 It does not mean that elasticsearch will fetch next page data after 30 seconds. 这并不意味着elasticsearch将在30秒后获取下一页数据。 When you are doing first scroll request you need to specify when scroll context should be closed. 在执行第一个滚动请求时,您需要指定何时关闭滚动上下文。 scroll parameter is telling to close scroll context after 30 seconds. scroll参数指示30秒后关闭滚动上下文。

After doing first scroll request you will get back scroll_id parameter in response. 在执行第一个滚动请求后,您将获得响应的scroll_id参数。 For next pages you need to pass that value to get next page of the scroll response. 对于下一页,您需要传递该值以获得滚动响应的下一页。 If you will not do the next scroll request within 30 seconds, the scroll request will be closed and you will not be able to get next pages for that scroll request. 如果您将在30秒内不执行下一个滚动请求,则该滚动请求将被关闭,您将无法获取该滚动请求的下一页。

What you described as an example use case is actually search results pagination , which is available for any search query and is limited by 10k results. 您作为示例用例描述的实际上是搜索结果分页 ,它可用于任何搜索查询,并且受到1万个结果的限制。 scroll requests are needed for the cases when you need to go over that 10k limit, with scroll query you can fetch even the entire collection of documents. 如果您需要超过10k的限制,则需要scroll请求,使用scroll查询甚至可以获取整个文档集合。

Probably the source of confusion here is that scroll term is ambiguous: it means the type of a query, and also it is a name of a parameter of such query (as was mentioned in other comments , it is time ES will keep waiting for you to fetch next chunk of scrolling). 造成混淆的原因可能是scroll术语含糊不清:它表示查询的类型,并且也是此类查询的参数名称(如其他注释中所述 ,这是ES会继续等待您的时间)获取下一个滚动块)。

scroll queries are heavy, and should be avoided until absolutely necessary. scroll查询很繁琐,应避免使用,直到绝对必要为止。 In fact, in the docs it says: 实际上,在文档中它说:

Scrolling is not intended for real time user requests, but rather for processing large amounts of data, ... 滚动不用于实时用户请求,而是用于处理大量数据,...

Now regarding your another question: 现在关于另一个问题:

In elasticsearch is it possible to call search API everytime whenever the user scrolls on the result set? 在Elasticsearch中,是否每当用户滚动结果集时都可以调用搜索API?

Yes, even several parallel scroll requests are possible: 是的,甚至可能有几个并行滚动请求

Each scroll is independent and can be processed in parallel like any scroll request. 每个滚动都是独立的,可以像任何滚动请求一样并行处理。

The documentation of the Scroll API at elastic explains this behaviour also. Scroll API的弹性文档也解释了此行为。

The result size of 10k is a default value and can be overwritten during runtime, if necessary: 结果大小10k是默认值,可以在运行时覆盖,如有必要:

PUT { "index" : { "max_result_window" : 500000} }

The life time of the scroll id is defined in each scroll request with the parameter "scroll", eg 在每个滚动请求中使用参数“ scroll”定义滚动ID的生存时间,例如

..
  "scroll" : "5m"
  ..

使用 scroll api 是明智的,因为在 elasticsearch 中一次不能获得超过 10K 的数据。

In recent versions of Elasticsearch, you'll use search_after .在 Elasticsearch 的最新版本中,您将使用search_after The keep_alive you set there, much like the timeout in the scroll, is only the time needed for you to process one page.你在那里设置的keep_alive ,很像滚动中的timeout ,只是你处理一页所需的时间。

That's because Elasticsearch will keep your "search context" alive for that amount of time, then removes it.那是因为 Elasticsearch 会在这段时间内让您的“搜索上下文”保持活动状态,然后将其删除。 Also, Elasticsearch won't fetch the next page for you automatically, you'll have to do that by sending requests with the ID from the last request.此外,Elasticsearch 不会自动为您获取下一页,您必须通过发送带有上次请求中的 ID 的请求来实现。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM