简体繁体 English

ElasticSearch 滚动 API 内部如何工作

[英]How ElasticSearch Scroll API works internally

原文 2020-07-15 04:06:38 7 1 elasticsearch

I am using elasticsearch scroll API as documented here .我正在使用此处记录的 elasticsearch 滚动 API 。 It's well understood that each scroll request takes as input a scroll id returned in response of previous scroll response.众所周知，每个滚动请求都将响应先前滚动响应返回的滚动 ID 作为输入。 Once done with scrolling all the chunks, the last scroll id needs to be cleared.滚动完所有块后，需要清除最后一个滚动 ID。

Use Case用例

Consume a big data set (in order of 0.1 to 2 million) matching a given query in chunk size of 5000. Individual chunk query performance is good.使用与给定查询匹配的大数据集（大约 0.1 到 200 万），块大小为 5000。单个块查询性能良好。
Data is most likely to be queried from single indice and shard.最有可能从单个索引和分片中查询数据。
The data which is being queried never gets updated in real time.正在查询的数据永远不会实时更新。

Questions / Concerns问题/疑虑

How elastic search maintains the scroll session or state internally?弹性搜索如何在内部维护滚动 session 或 state？ Will all the matching documents (or their ids) stored or parked aside in-memory and returned in subsequent scroll requests?所有匹配的文档（或其 ID）是否会存储或停放在内存中并在后续滚动请求中返回？ Should I be concerned about RAM/CPU that are currently allocated to the cluster.我是否应该担心当前分配给集群的 RAM/CPU。
Are there any performance penalty while using the scroll API?使用滚动 API 时是否有任何性能损失？ I understand that there is default max number of scroll session allowed at a time which is 500. This default is acceptable in my case as number of requests per seconds in quite low.我知道一次允许的默认最大滚动数 session 是 500。在我的情况下，这个默认值是可以接受的，因为每秒的请求数非常低。

1 个解决方案

During performance testing in my environment, the scroll API with scroll size set to 7,000, GC pause time upto 1.5 minutes and high CPU usage was observed.在我的环境中进行性能测试期间，观察到滚动大小设置为 7,000 的滚动 API，GC 暂停时间长达 1.5 分钟，并且观察到 CPU 使用率很高。 ( Obviously this is also affected by the cluster configuration and type of query that ran) （显然这也受到集群配置和运行查询类型的影响）

From the documentation and an informative blog来自文档和内容丰富的博客

The results that are returned from a scroll request reflect the state of the data stream or index at the time that the initial search request was made, like a snapshot in time.从滚动请求返回的结果反映了发出初始搜索请求时数据 stream 或索引的 state 或索引，就像时间快照一样。 Subsequent changes to documents (index, update or delete) will only affect later search requests.对文档的后续更改（索引、更新或删除）只会影响以后的搜索请求。

The data matching the search-request passed in first scroll API is kept aside in memory.与第一个滚动 API 中传递的搜索请求匹配的数据保存在 memory 中。 Quoting from the mentioned blog:-引用上述博客：-

As I mentioned above, scrolling works by taking a "snapshot" of your data and then serving it to you in pieces.正如我上面提到的，滚动的工作原理是获取数据的“快照”，然后将其分段提供给您。 This means that Elasticsearch must "hold" all of that in memory.* Having to hold the scroll "snapshot" in memory while doing a lot of data updates can cause your memory to bloat. This means that Elasticsearch must "hold" all of that in memory.* Having to hold the scroll "snapshot" in memory while doing a lot of data updates can cause your memory to bloat. Memory bloat can lead to issues if you don't have a large surplus of memory to work with.如果您没有大量剩余的 memory 可供使用，Memory 膨胀可能会导致问题。

Short Answer Yes, do consider heap and cpu usage while using the scroll API.简短回答是的，在使用滚动 API 时请考虑堆和 CPU 使用情况。 Factor like request per second and optimal scroll size should be considered for given cluster configuration.对于给定的集群配置，应考虑每秒请求数和最佳滚动大小等因素。