使用ElasticSearch Scroll API时，如何原位优化时间参数？

Question

我正在使用 elasticsearch scroll api 返回大量文档。 根据文档，

"scroll到期时间在我们每次运行滚动请求时都会刷新，所以它只需要足够长的时间来处理当前批次的结果，而不是所有匹配查询的文档。超时很重要，因为保持滚动窗口“open 消耗资源，我们希望在不再需要它们时立即释放它们。设置超时使 Elasticsearch 能够在一小段时间不活动后自动释放资源。”

我的问题是如何优化时间参数？ 我有一些实例需要请求和处理大约 600 页，但它会在第 300 页失败（很长的路要走！）。 我怀疑如果我可以优化传递的时间参数，它会更有效地使用 ES 资源并且不容易失败。 这段代码正在这里的集群上进行测试，但可能会移植到许多其他集群，所以我希望优化时间参数以适应集群。 另外，我不想在 ES 集群上使用比我需要的更多的资源，其他用户可能也会使用它。

这是我的想法。 在初始滚动请求中，传递一个慷慨的时间参数，比如5m ，然后计算返回第一页结果所需的时间。 然后在第二个滚动请求中，我们向它传递一个时间参数，该参数仅比第一个请求所花费的观察时间大一点。 归纳起来，每个页面请求的时间都会比之前观察到的页面完成时间稍长。 这假设由于每个页面返回相同数量的文档（在我的例子中几乎相同的大小），它需要返回该页面的时间与之前观察到的大致相同。 这个假设成立吗？

是否有更智能的方法来适应时间参数？ 就此而言，大小参数（在上面的想法中，大小参数保持固定）。

Answer 1

好的，我做了一些数据分析，并根据经验发现了一些东西。 对于许多不同的大小，我运行了 10-20 页的滚动 API 查询。 对于固定大小，返回页面所需的时间大致呈高斯分布，方法如下。

means =  {1000: 6.0284869194030763,
 1500: 7.9487858772277828,
 2000: 12.139444923400879,
 2500: 18.494202852249146,
 3000: 22.169868159294129,
 3500: 28.091009926795959,
 4000: 36.068559408187866,
 5000: 53.229292035102844}

我的下一个想法是，这可能取决于机器上是否正在运行其他查询，所以我运行了实验，其中一半页面是来自 ES 的唯一请求，另一半页面是第二个滚动查询正在运行。 时间似乎没有改变。

最后，由于时间将取决于给定的 ES 配置和带宽等。我提出了这个解决方案。

为初始页面设置充足的页面时间。
每页计时

在观察时间+一点点和初始时间之间使用加权运行平均值（因此您的时间参数总是比需要的大一点，但会降低到平均值+一点点）。 下面是一个例子：

尝试 = 0 size = 3000 wait_time = 2 ## 慷慨的开始时间
Returned_hits = {} ## 页，尝试次数 < 3 时的命中列表： try: print "\\n\\tRunning the alert scroll query with size = %s... " %( size ) page = client.search(index = index , doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = size )

 sid = page['_scroll_id'] ## scroll id total_hits = page['hits']['total'] ## how many results there are. print "\\t\\t There are %s hits total." %(total_hits) p = 0 ## page count doc_count = 0 ## document count # Start scrolling while (scroll_size > 0): p += 1 print "\\t\\t Scrolling to page %s ..." % p start = time.time() page = client.scroll(scroll_id = sid, scroll = str(wait_time) + 'm') end = time.time() ## update wait_time using a weighted running average. wait_time = ( (end - start + 10) + float(wait_time * p) ) / (p+1) print "\\t\\t Page %s took %s seconds. We change the time to %s" %(p, end - start, wait_time) sid = page['_scroll_id'] # Update the scroll ID scroll_size = len(page["hits"]["hits"]) ## no. of hits returned on this page print "\\t\\t Page %s has returned %s hits. Storing .." %( p, scroll_size ) returned_hits[p] = page['hits']['hits'] doc_count += scroll_size ## update the total count of docs processed print "\\t\\t Returned and stored %s docs of %s \\n" %(doc_count, total_hits) tries = 3 ## set tries to three so we exit the while loop! except: e = sys.exc_info()[0] print "\\t\\t ---- Error on try %s\\n\\t\\t size was %s, wait_time was %s min, \\n\\t\\terror message = %s" %(tries , _size, wait_time, e) tries += 1 ## increment tries, and do it again until 3 tries. # wait_time *= 2 ## double the time interval for the next go round size = int(.8 * size) ## lower size of docs per shard returned. if tries == 3: print "\\t\\t three strikes and you're out! (failed three times in a row to execute the alert query). Exiting. " else: print '\\t\\t ---- trying again for the %s-th time ...' %( tries + 1 )

使用ElasticSearch Scroll API时，如何原位优化时间参数？

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-09-28 18:40:59

使用ElasticSearch Scroll API时，如何原位优化时间参数？

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-09-28 18:40:59

解决方案1
1 已采纳 2016-09-28 18:40:59