简体   繁体   中英

Elasticsearch returning same results at different FROM values

I'm currently looping through 29,000 documents, in each document I add a sub-doc to a nested field and update. To manage the amount of data I'm dealing with, I'm breaking the loops into groups of 10,000 and using the ES size and from options to control where each iteration should start from. So, once the first 10,000 is updated, I do another query to fetch the next 10,000 and so on... The problem is every time I get to the second group there are a handful of docs in the batch that were already processed in the first 10,000 and when I get to the third batch it's all documents that have already been processed when it should be fetching docs from the 20,000 to 29,000 range.

It seems like I'm in some sort of race condition since doing a sort or a query by version number achieves nothing. I've also tried flushing and refreshing between queries and still no luck.

Has anyone had a similar issue?

In ElasticSearch there is up to a 1 second lag between when something is written and when it is available for reading. You can easily create a test to verify this, insert record with id 1, immediately try to read id 1, you'll get back null.

What you want to do is use a " SCROLL SCAN " in ES. When using a scroll it keeps track of what records it's given you back already so that when you request back out the next 10,000 you're guaranteed not to get any duplicates.

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

Note: when you specify the size of your scroll scan the size you specify is per shard . So if you want back chunks of 10,000 you need to specify size = 10,000/# number of shards

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM