使用ElasticSearch Scroll API時，如何原位優化時間參數？

Question

我正在使用 elasticsearch scroll api 返回大量文檔。 根據文檔，

"scroll到期時間在我們每次運行滾動請求時都會刷新，所以它只需要足夠長的時間來處理當前批次的結果，而不是所有匹配查詢的文檔。超時很重要，因為保持滾動窗口“open 消耗資源，我們希望在不再需要它們時立即釋放它們。設置超時使 Elasticsearch 能夠在一小段時間不活動后自動釋放資源。”

我的問題是如何優化時間參數？ 我有一些實例需要請求和處理大約 600 頁，但它會在第 300 頁失敗（很長的路要走！）。 我懷疑如果我可以優化傳遞的時間參數，它會更有效地使用 ES 資源並且不容易失敗。 這段代碼正在這里的集群上進行測試，但可能會移植到許多其他集群，所以我希望優化時間參數以適應集群。 另外，我不想在 ES 集群上使用比我需要的更多的資源，其他用戶可能也會使用它。

這是我的想法。 在初始滾動請求中，傳遞一個慷慨的時間參數，比如5m ，然后計算返回第一頁結果所需的時間。 然后在第二個滾動請求中，我們向它傳遞一個時間參數，該參數僅比第一個請求所花費的觀察時間大一點。 歸納起來，每個頁面請求的時間都會比之前觀察到的頁面完成時間稍長。 這假設由於每個頁面返回相同數量的文檔（在我的例子中幾乎相同的大小），它需要返回該頁面的時間與之前觀察到的大致相同。 這個假設成立嗎？

是否有更智能的方法來適應時間參數？ 就此而言，大小參數（在上面的想法中，大小參數保持固定）。

Answer 1

好的，我做了一些數據分析，並根據經驗發現了一些東西。 對於許多不同的大小，我運行了 10-20 頁的滾動 API 查詢。 對於固定大小，返回頁面所需的時間大致呈高斯分布，方法如下。

means =  {1000: 6.0284869194030763,
 1500: 7.9487858772277828,
 2000: 12.139444923400879,
 2500: 18.494202852249146,
 3000: 22.169868159294129,
 3500: 28.091009926795959,
 4000: 36.068559408187866,
 5000: 53.229292035102844}

我的下一個想法是，這可能取決於機器上是否正在運行其他查詢，所以我運行了實驗，其中一半頁面是來自 ES 的唯一請求，另一半頁面是第二個滾動查詢正在運行。 時間似乎沒有改變。

最后，由於時間將取決於給定的 ES 配置和帶寬等。我提出了這個解決方案。

為初始頁面設置充足的頁面時間。
每頁計時

在觀察時間+一點點和初始時間之間使用加權運行平均值（因此您的時間參數總是比需要的大一點，但會降低到平均值+一點點）。 下面是一個例子：

嘗試 = 0 size = 3000 wait_time = 2 ## 慷慨的開始時間
Returned_hits = {} ## 頁，嘗試次數 < 3 時的命中列表： try: print "\\n\\tRunning the alert scroll query with size = %s... " %( size ) page = client.search(index = index , doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = size )

 sid = page['_scroll_id'] ## scroll id total_hits = page['hits']['total'] ## how many results there are. print "\\t\\t There are %s hits total." %(total_hits) p = 0 ## page count doc_count = 0 ## document count # Start scrolling while (scroll_size > 0): p += 1 print "\\t\\t Scrolling to page %s ..." % p start = time.time() page = client.scroll(scroll_id = sid, scroll = str(wait_time) + 'm') end = time.time() ## update wait_time using a weighted running average. wait_time = ( (end - start + 10) + float(wait_time * p) ) / (p+1) print "\\t\\t Page %s took %s seconds. We change the time to %s" %(p, end - start, wait_time) sid = page['_scroll_id'] # Update the scroll ID scroll_size = len(page["hits"]["hits"]) ## no. of hits returned on this page print "\\t\\t Page %s has returned %s hits. Storing .." %( p, scroll_size ) returned_hits[p] = page['hits']['hits'] doc_count += scroll_size ## update the total count of docs processed print "\\t\\t Returned and stored %s docs of %s \\n" %(doc_count, total_hits) tries = 3 ## set tries to three so we exit the while loop! except: e = sys.exc_info()[0] print "\\t\\t ---- Error on try %s\\n\\t\\t size was %s, wait_time was %s min, \\n\\t\\terror message = %s" %(tries , _size, wait_time, e) tries += 1 ## increment tries, and do it again until 3 tries. # wait_time *= 2 ## double the time interval for the next go round size = int(.8 * size) ## lower size of docs per shard returned. if tries == 3: print "\\t\\t three strikes and you're out! (failed three times in a row to execute the alert query). Exiting. " else: print '\\t\\t ---- trying again for the %s-th time ...' %( tries + 1 )

使用ElasticSearch Scroll API時，如何原位優化時間參數？

問題描述

1 個解決方案

解決方案1
1 已采納 2016-09-28 18:40:59

使用ElasticSearch Scroll API時，如何原位優化時間參數？

問題描述

1 個解決方案

解決方案1 1 已采納 2016-09-28 18:40:59

解決方案1
1 已采納 2016-09-28 18:40:59