I tried to scroll all documents with python when I query Elasticsearch so I can get over 10K results:
from elasticsearch import Elasticsearch
es = Elasticsearch(ADDRESS, port=PORT)
result = es.search(
index="INDEX",
body=es_query,
size=10000,
scroll="3m")
scroll_id = result['_scroll_id']
scroll_size = result["hits"]["total"]
counter = 0
print('total items= ' + scroll_size)
while(scroll_size > 0):
counter +=len(result['hits']['hits'])
result = es.scroll(scroll_id=scroll_id, scroll="1s")
scroll_id = result['_scroll_id']
print('found = ' +counter)
The problem is that sometimes the counter
(the sum of the results at the end of the program) is smaller than result["hits"]["total"]
. Why is that? Why does scroll
not iterate over all the results?
ElasticSearch version : 5.6
lucence version :6.6
If I'm not mistaken, you're adding the initial result["hits"]["total"]
to your counter
in the first iteration of the while
loop -- but you should be adding just the length of the retrieved hits:
scroll_id = result['_scroll_id']
total = result["hits"]["total"]
print('total = %d' % total)
scroll_size = len(result["hits"]["hits"]) # this is the current 'page' size
counter = 0
while(scroll_size > 0):
counter += scroll_size
result = es.scroll(scroll_id=scroll_id, scroll="1s")
scroll_id = result['_scroll_id']
scroll_size = len(result['hits']['hits'])
print('counter = %d' % counter)
assert counter == total
As a matter of fact, you don't need to store the scroll size separately -- a more concise while
loop would be:
while len(result['hits']['hits']):
counter += len(result['hits']['hits'])
result = es.scroll(scroll_id=scroll_id, scroll="1s")
scroll_id = result['_scroll_id']
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.