简体   繁体   中英

Elasticsearch Reindexing race condition

Hello elasticsearch users/experts,

I have a bit of trouble understanding the race condition problem with the reindex api of Elasticsearch and would like to hear if anyone has found a solution about it.

I have searched a lot of places and could not find any clear solution (most of the solutions date back to before the reindex api).

As you might know, the (now) standard way of reindexing a document (after changing the mapping, for example) is to use an alias. Suppose the alias points to "old_index". We then create a new index called "new_index" with the new mapping, we call the reindex api to reindex the documents from 'old_index' to 'new_index' and then switch the alias to point to the new_index (and remove the alias pointer to old_index). It seems this is the standard way of reindexing and that is what I have seen on almost all recent websites I visited.

My questions are the following, for using this method, while I would not want downtime (so the user should still be able to search documents), and I would still want to be able to inject documents to ElasticSearch while the reindexing process is happening :

  1. If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?
  2. If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?
  3. (Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?

Basically in a scenario where it is not affordable to make any indexing mistake for a document, how would one be able to proceed to make sure the reindexing goes without any of the above problems?

Has anyone any idea? And if there is no solution without any downtime, how would we proceed with the least amount of downtime in that case?

Thanks in advance!

Apologies if its too verbose, but my two cents:

If documents would still be incoming while the reindexing process is working (which would probably take a lot of time), how would the reindexing process ensure that the document would be ingested in the old index (to be able to search for it while the reindexing process is working) but still would be correctly reindexed to the new index?

When a reindexing is happening from source to destination, the alias would(and must be) still be pointed to the source_index . All the modifications/changes to this index happens in independent fashion and these updates/deletes should be affecting immediately.

Let's say the state of source_index changes from t to t+1

If you have ran a reindexing job at t to dest_index , it would still consume the data of snapshot of source_index at t . You need to run reindexing job again to have latest data of source_index ie data at t+1 in your dest_index .

Ingestions at source_index and ingestions from source_index to destination_index are both independent transactions/processes.

Reindexing jobs will never always guarantee consistency between source_index and dest_index .

If a document is modified in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this modification is also taken account in the new index?

It won't be taken account in the new index as reindexing would be making use of snapshot of source_index at time t .

You would need to perform reindexing again. For this general approach would be to have a scheduler which keeps running reindexing process every few hours.

You can have updates/deletes happening at source_index every few minutes(if you are using scheduler) or in real time(if you are using any event based approach).

However for full indexing (from source_index to dest_index ), have it scheduled like once in a day or twice as it is an expensive process.

(Similar to 2.) If a record is deleted in the old index, after it has been reindexed (mapped to the new index), while the reindexing process is working, how would ElasticSearch ensure that this removal is also taken account in the new index?

Again, you need to run a new job/reindexing process.

Version_type: External

Just as a side note, one interesting what you can do during reindexing, is to make use of the version_type:external which would ensure only the updated/missing documents from source_index would be reindexed in dest_index

You can refer to this LINK for more info on this

POST _reindex
{
  "source": {
    "index": "source_index"
  },
  "dest": {
    "index": "dest_index",
    "version_type": "external"
  }
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM