When using ElasticSearch Scroll API, how to optimize the time parameter in situ?

Question

I'm using the elasticsearch scroll api to return a large number of documents. According to the documentation ,

"The scroll expiry time is refreshed every time we run a scroll request, so it only needs to be long enough to process the current batch of results, not all of the documents that match the query.The timeout is important because keeping the scroll window open consumes resources and we want to free them as soon as they are no longer needed. Setting the timeout enables Elasticsearch to automatically free the resources after a small period of inactivity."

My question is how to optimize the time parameter? I've had some instances where I need to request and process ~600 pages, and it will fail on page 300 ( a long way in!). I suspect if I could optimize the time parameter that is passed, it would be much more efficient in using the ES resources and not prone to failing. This code is being tested on a cluster here, but will be ported to possibly many other clusters, so I'd like the optimization of the time parameter to self adapt to the cluster. Also, i don't want to use more resources on the ES cluster than I need b/c other users will presumably but using it too.

Here's my idea. On the initial scroll request, pass a generous time parameter, say 5m , and then time how long it takes to return the first page of results. Then on the second scroll request, we pass it a time parameter that's just a little bit bigger than the observed time it took for the first request. Inductively, each page request will come with the time slightly larger than the previously observed page's completion time. This assumes that since each page is returning the same number of docs (very nearly the same size in my case) the time it needs to return that page is roughly identical to the previously observed ones. Does this assumption hold?

Are there more intelligent ways to adapt the time parameter? and for that matter the size parameter (in the idea above, the size parameter remains fixed).

Answer 1

Ok, I did some data analysis and found a few things empirically. For many different sizes I ran 10-20 pages of a scroll api query. For a fixed size, the time it took to return a page was roughly Gaussian with means given below.

means =  {1000: 6.0284869194030763,
 1500: 7.9487858772277828,
 2000: 12.139444923400879,
 2500: 18.494202852249146,
 3000: 22.169868159294129,
 3500: 28.091009926795959,
 4000: 36.068559408187866,
 5000: 53.229292035102844}

The next thought i had was that this may depend on whether other queries are being run on the machine, so I ran the experiment with half of the pages being the only request from ES and half while a second scroll query was running. The timing didn't seem to change.

finally, since the times will depend on the given ES configuration and bandwidth, etc.. I propose this solution.

set a generous page time for the initial page.
time each page

use a weighted running average between the observed time + a little bit, and the initial time (so your time parameter is always a bit bigger than needed, but decreases to the mean + a little bit). Here's an example:

tries = 0 size = 3000 wait_time = 2 ## generous start time
returned_hits = {} ## Page, list of hits while tries < 3: try: print "\\n\\tRunning the alert scroll query with size = %s... " %( size ) page = client.search(index = index, doc_type = doc_type, body = q, scroll = '1m', search_type = 'scan', size = size )

 sid = page['_scroll_id'] ## scroll id total_hits = page['hits']['total'] ## how many results there are. print "\\t\\t There are %s hits total." %(total_hits) p = 0 ## page count doc_count = 0 ## document count # Start scrolling while (scroll_size > 0): p += 1 print "\\t\\t Scrolling to page %s ..." % p start = time.time() page = client.scroll(scroll_id = sid, scroll = str(wait_time) + 'm') end = time.time() ## update wait_time using a weighted running average. wait_time = ( (end - start + 10) + float(wait_time * p) ) / (p+1) print "\\t\\t Page %s took %s seconds. We change the time to %s" %(p, end - start, wait_time) sid = page['_scroll_id'] # Update the scroll ID scroll_size = len(page["hits"]["hits"]) ## no. of hits returned on this page print "\\t\\t Page %s has returned %s hits. Storing .." %( p, scroll_size ) returned_hits[p] = page['hits']['hits'] doc_count += scroll_size ## update the total count of docs processed print "\\t\\t Returned and stored %s docs of %s \\n" %(doc_count, total_hits) tries = 3 ## set tries to three so we exit the while loop! except: e = sys.exc_info()[0] print "\\t\\t ---- Error on try %s\\n\\t\\t size was %s, wait_time was %s min, \\n\\t\\terror message = %s" %(tries , _size, wait_time, e) tries += 1 ## increment tries, and do it again until 3 tries. # wait_time *= 2 ## double the time interval for the next go round size = int(.8 * size) ## lower size of docs per shard returned. if tries == 3: print "\\t\\t three strikes and you're out! (failed three times in a row to execute the alert query). Exiting. " else: print '\\t\\t ---- trying again for the %s-th time ...' %( tries + 1 )

When using ElasticSearch Scroll API, how to optimize the time parameter in situ?

Question

1 answers

solution1
1 ACCPTED 2016-09-28 18:40:59

When using ElasticSearch Scroll API, how to optimize the time parameter in situ?

Question

1 answers

solution1 1 ACCPTED 2016-09-28 18:40:59

solution1
1 ACCPTED 2016-09-28 18:40:59