简体繁体 English

Elasticsearch在不同的FROM值处返回相同的结果

[英]Elasticsearch returning same results at different FROM values

原文 2015-06-29 22:52:50 2 1 node.js/ elasticsearch

I'm currently looping through 29,000 documents, in each document I add a sub-doc to a nested field and update. 我目前正在遍历29,000个文档，在每个文档中，我都会向嵌套字段中添加一个子文档并进行更新。 To manage the amount of data I'm dealing with, I'm breaking the loops into groups of 10,000 and using the ES size and from options to control where each iteration should start from. 为了管理我要处理的数据量，我将循环分为10,000组，并使用ES大小和from选项来控制每个迭代的起始位置。 So, once the first 10,000 is updated, I do another query to fetch the next 10,000 and so on... The problem is every time I get to the second group there are a handful of docs in the batch that were already processed in the first 10,000 and when I get to the third batch it's all documents that have already been processed when it should be fetching docs from the 20,000 to 29,000 range. 因此，一旦第一个10,000更新，我将执行另一个查询以获取下一个10,000，依此类推...问题是，每次我进入第二组时，批处理中都已经处理了少数文档。前10,000个，当我到达第三个批次时，所有应处理的文档应从20,000到29,000范围内获取文档。

It seems like I'm in some sort of race condition since doing a sort or a query by version number achieves nothing. 似乎我处于某种竞争状态，因为按版本号进行排序或查询无法获得任何结果。 I've also tried flushing and refreshing between queries and still no luck. 我也尝试过在查询之间刷新和刷新，但仍然没有运气。

Has anyone had a similar issue? 有人遇到过类似的问题吗？

1 个解决方案

In ElasticSearch there is up to a 1 second lag between when something is written and when it is available for reading. 在ElasticSearch中，什么时候写到什么时候可以读之间最多有1秒的延迟。 You can easily create a test to verify this, insert record with id 1, immediately try to read id 1, you'll get back null. 您可以轻松创建一个测试来验证这一点，插入ID为1的记录，立即尝试读取ID为1，您将获得空值。

What you want to do is use a " SCROLL SCAN " in ES. 您要做的是在ES中使用“ SCROLL SCAN ”。 When using a scroll it keeps track of what records it's given you back already so that when you request back out the next 10,000 you're guaranteed not to get any duplicates. 使用滚动条时，它会跟踪已返回给您的记录，因此当您请求退回下一个10,000条记录时，可以确保不会重复。

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html#scroll-scan https://www.elastic.co/guide/zh-CN/elasticsearch/reference/current/search-request-scroll.html#scroll-scan

Note: when you specify the size of your scroll scan the size you specify is per shard . 注意：指定滚动扫描的大小时，指定的大小是每个碎片 。 So if you want back chunks of 10,000 you need to specify size = 10,000/# number of shards 因此，如果要返回10,000个块，则需要指定size = 10,000/# number of shards