简体繁体 English

在Elasticsearch中增加Scroll API的大小时应考虑哪些因素？

[英]What considerations should I take into account when increasing the size in the Scroll API in Elasticsearch?

原文 2017-06-07 10:08:18 4 1 elasticsearch

I am currently toying around with the Scroll API of Elasticsearch, and want to use it to obtain a large set of data and do some manual processing on it. 我目前正在使用Elasticsearch的Scroll API ，并希望使用它来获取大量数据并对其进行一些手动处理。 The processing is performed by an external library and is not of the type that can easily be included as a script . 该处理由外部库执行，并且不属于可以轻易包含为脚本的类型。

While this seems to work nicely at the moment, I was wondering what considerations that I should take into account when fine-tuning the scroll size for performing this form of processing. 尽管目前看来这很好，但是我想知道在微调滚动尺寸以执行这种形式的处理时应考虑哪些注意事项。 A quick observation seems to indicate that increasing the scroll size will reduce the latency of the operation. 快速观察似乎表明增加滚动大小将减少操作的等待时间。 While I suspect that larger scroll sizes will generally reduce throughput, I have no idea whether this hypothesis is correct. 尽管我怀疑较大的滚动条通常会降低吞吐量，但我不知道这种假设是否正确。 Also, I have no idea if there are any other consequences that I do not envision right now. 另外，我不知道是否有其他后果我现在没有想到。

So to summarize, my question is: what impact does changing Elasticsearch's scroll size have, especially on performance, in a scenario where the results are processed for each batch that is obtained? 因此，总而言之，我的问题是：在为获得的每个批次处理结果的情况下，更改Elasticsearch的滚动大小会产生什么影响，特别是对性能有何影响？

Thanks in advance! 提前致谢！

1 个解决方案

The one (and the only I know of) consideration is to be able to process batch fast enough to not release scroll context (which is controlled by ?scroll=X parameter). 一个（也是我唯一知道的）考虑因素是能够足够快地处理批处理，而不会释放滚动上下文（由?scroll=X参数控制）。

Assuming that you will consume all the data from query, there, scroll should be tuned based on network and 3rd-party app performance. 假设您将使用查询中的所有数据，则应根据网络和第三方应用程序的性能来调整滚动。 Ie 即

if your app can process data in stream-like manner, bigger chunks is better 如果您的应用程序可以以类似流的方式处理数据，则越大的块越好
if your app processing data in batches (waiting for full ES response first), the upper limit for batch size should guarantee processing time < scroll release time 如果您的应用程序分批处理数据（首先等待完整的ES响应），则批处理大小的上限应保证处理时间<滚动释放时间
if you work in poor network environment, less batch size is better to handle overhead of dropped connections/retries 如果您在较差的网络环境中工作，则较小的批处理大小会更好地处理丢失的连接/重试的开销
generally, bigger batch is obviously better, as it eliminates some network/ES cpu overhead 通常，较大的批次显然更好，因为它消除了一些网络/ ES cpu开销