[英]How ElasticSearch Scroll API works internally
I am using elasticsearch scroll API as documented here .我正在使用此处记录的 elasticsearch 滚动 API 。 It's well understood that each scroll request takes as input a scroll id returned in response of previous scroll response.众所周知,每个滚动请求都将响应先前滚动响应返回的滚动 ID 作为输入。 Once done with scrolling all the chunks, the last scroll id needs to be cleared.滚动完所有块后,需要清除最后一个滚动 ID。
During performance testing in my environment, the scroll API with scroll size set to 7,000, GC pause time upto 1.5 minutes and high CPU usage was observed.在我的环境中进行性能测试期间,观察到滚动大小设置为 7,000 的滚动 API,GC 暂停时间长达 1.5 分钟,并且观察到 CPU 使用率很高。 ( Obviously this is also affected by the cluster configuration and type of query that ran) (显然这也受到集群配置和运行查询类型的影响)
From the documentation and an informative blog来自文档和内容丰富的博客
The results that are returned from a scroll request reflect the state of the data stream or index at the time that the initial search request was made, like a snapshot in time.从滚动请求返回的结果反映了发出初始搜索请求时数据 stream 或索引的 state 或索引,就像时间快照一样。 Subsequent changes to documents (index, update or delete) will only affect later search requests.对文档的后续更改(索引、更新或删除)只会影响以后的搜索请求。
The data matching the search-request passed in first scroll API is kept aside in memory.与第一个滚动 API 中传递的搜索请求匹配的数据保存在 memory 中。 Quoting from the mentioned blog:-引用上述博客:-
As I mentioned above, scrolling works by taking a "snapshot" of your data and then serving it to you in pieces.正如我上面提到的,滚动的工作原理是获取数据的“快照”,然后将其分段提供给您。 This means that Elasticsearch must "hold" all of that in memory.* Having to hold the scroll "snapshot" in memory while doing a lot of data updates can cause your memory to bloat. This means that Elasticsearch must "hold" all of that in memory.* Having to hold the scroll "snapshot" in memory while doing a lot of data updates can cause your memory to bloat. Memory bloat can lead to issues if you don't have a large surplus of memory to work with.如果您没有大量剩余的 memory 可供使用,Memory 膨胀可能会导致问题。
Short Answer Yes, do consider heap and cpu usage while using the scroll API.简短回答是的,在使用滚动 API 时请考虑堆和 CPU 使用情况。 Factor like request per second and optimal scroll size should be considered for given cluster configuration.对于给定的集群配置,应考虑每秒请求数和最佳滚动大小等因素。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.