简体   繁体   English

使用ElasticSearch Scroll API时,如何原位优化时间参数?

[英]When using ElasticSearch Scroll API, how to optimize the time parameter in situ?

I'm using the elasticsearch scroll api to return a large number of documents.我正在使用 elasticsearch scroll api 返回大量文档。 According to the documentation ,根据文档

"The scroll expiry time is refreshed every time we run a scroll request, so it only needs to be long enough to process the current batch of results, not all of the documents that match the query.The timeout is important because keeping the scroll window open consumes resources and we want to free them as soon as they are no longer needed. Setting the timeout enables Elasticsearch to automatically free the resources after a small period of inactivity." "scroll到期时间在我们每次运行滚动请求时都会刷新,所以它只需要足够长的时间来处理当前批次的结果,而不是所有匹配查询的文档。超时很重要,因为保持滚动窗口“open 消耗资源,我们希望在不再需要它们时立即释放它们。设置超时使 Elasticsearch 能够在一小段时间不活动后自动释放资源。”

My question is how to optimize the time parameter?我的问题是如何优化时间参数? I've had some instances where I need to request and process ~600 pages, and it will fail on page 300 ( a long way in!).我有一些实例需要请求和处理大约 600 页,但它会在第 300 页失败(很长的路要走!)。 I suspect if I could optimize the time parameter that is passed, it would be much more efficient in using the ES resources and not prone to failing.我怀疑如果我可以优化传递的时间参数,它会更有效地使用 ES 资源并且不容易失败。 This code is being tested on a cluster here, but will be ported to possibly many other clusters, so I'd like the optimization of the time parameter to self adapt to the cluster.这段代码正在这里的集群上进行测试,但可能会移植到许多其他集群,所以我希望优化时间参数以适应集群。 Also, i don't want to use more resources on the ES cluster than I need b/c other users will presumably but using it too.另外,我不想在 ES 集群上使用比我需要的更多的资源,其他用户可能也会使用它。

Here's my idea.这是我的想法。 On the initial scroll request, pass a generous time parameter, say 5m , and then time how long it takes to return the first page of results.在初始滚动请求中,传递一个慷慨的时间参数,比如5m ,然后计算返回第一页结果所需的时间。 Then on the second scroll request, we pass it a time parameter that's just a little bit bigger than the observed time it took for the first request.然后在第二个滚动请求中,我们向它传递一个时间参数,该参数仅比第一个请求所花费的观察时间大一点。 Inductively, each page request will come with the time slightly larger than the previously observed page's completion time.归纳起来,每个页面请求的时间都会比之前观察到的页面完成时间稍长。 This assumes that since each page is returning the same number of docs (very nearly the same size in my case) the time it needs to return that page is roughly identical to the previously observed ones.这假设由于每个页面返回相同数量的文档(在我的例子中几乎相同的大小),它需要返回该页面的时间与之前观察到的大致相同。 Does this assumption hold?这个假设成立吗?

Are there more intelligent ways to adapt the time parameter?是否有更智能的方法来适应时间参数? and for that matter the size parameter (in the idea above, the size parameter remains fixed).就此而言,大小参数(在上面的想法中,大小参数保持固定)。

Ok, I did some data analysis and found a few things empirically.好的,我做了一些数据分析,并根据经验发现了一些东西。 For many different sizes I ran 10-20 pages of a scroll api query.对于许多不同的大小,我运行了 10-20 页的滚动 API 查询。 For a fixed size, the time it took to return a page was roughly Gaussian with means given below.对于固定大小,返回页面所需的时间大致呈高斯分布,方法如下。

means =  {1000: 6.0284869194030763,
 1500: 7.9487858772277828,
 2000: 12.139444923400879,
 2500: 18.494202852249146,
 3000: 22.169868159294129,
 3500: 28.091009926795959,
 4000: 36.068559408187866,
 5000: 53.229292035102844}

The next thought i had was that this may depend on whether other queries are being run on the machine, so I ran the experiment with half of the pages being the only request from ES and half while a second scroll query was running.我的下一个想法是,这可能取决于机器上是否正在运行其他查询,所以我运行了实验,其中一半页面是来自 ES 的唯一请求,另一半页面是第二个滚动查询正在运行。 The timing didn't seem to change.时间似乎没有改变。

finally, since the times will depend on the given ES configuration and bandwidth, etc.. I propose this solution.最后,由于时间将取决于给定的 ES 配置和带宽等。我提出了这个解决方案。

  1. set a generous page time for the initial page.为初始页面设置充足的页面时间。
  2. time each page每页计时
  3. use a weighted running average between the observed time + a little bit, and the initial time (so your time parameter is always a bit bigger than needed, but decreases to the mean + a little bit).在观察时间+一点点和初始时间之间使用加权运行平均值(因此您的时间参数总是比需要的大一点,但会降低到平均值+一点点)。 Here's an example:下面是一个例子:

    tries = 0 size = 3000 wait_time = 2 ## generous start time尝试 = 0 size = 3000 wait_time = 2 ## 慷慨的开始时间
    returned_hits = {} ## Page, list of hits while tries < 3: try: print "\\n\\tRunning the alert scroll query with size = %s... " %( size ) page = client.search(index = index, doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = size ) Returned_hits = {} ## 页,尝试次数 < 3 时的命中列表: try: print "\\n\\tRunning the alert scroll query with size = %s... " %( size ) page = client.search(index = index , doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = size )

     sid = page['_scroll_id'] ## scroll id total_hits = page['hits']['total'] ## how many results there are. print "\\t\\t There are %s hits total." %(total_hits) p = 0 ## page count doc_count = 0 ## document count # Start scrolling while (scroll_size > 0): p += 1 print "\\t\\t Scrolling to page %s ..." % p start = time.time() page = client.scroll(scroll_id = sid, scroll = str(wait_time) + 'm') end = time.time() ## update wait_time using a weighted running average. wait_time = ( (end - start + 10) + float(wait_time * p) ) / (p+1) print "\\t\\t Page %s took %s seconds. We change the time to %s" %(p, end - start, wait_time) sid = page['_scroll_id'] # Update the scroll ID scroll_size = len(page["hits"]["hits"]) ## no. of hits returned on this page print "\\t\\t Page %s has returned %s hits. Storing .." %( p, scroll_size ) returned_hits[p] = page['hits']['hits'] doc_count += scroll_size ## update the total count of docs processed print "\\t\\t Returned and stored %s docs of %s \\n" %(doc_count, total_hits) tries = 3 ## set tries to three so we exit the while loop! except: e = sys.exc_info()[0] print "\\t\\t ---- Error on try %s\\n\\t\\t size was %s, wait_time was %s min, \\n\\t\\terror message = %s" %(tries , _size, wait_time, e) tries += 1 ## increment tries, and do it again until 3 tries. # wait_time *= 2 ## double the time interval for the next go round size = int(.8 * size) ## lower size of docs per shard returned. if tries == 3: print "\\t\\t three strikes and you're out! (failed three times in a row to execute the alert query). Exiting. " else: print '\\t\\t ---- trying again for the %s-th time ...' %( tries + 1 )

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM