简体   繁体   中英

How to check properties before update in elasticsearch?

I've already read official documentation and find no way.

My datas to es are from kafka which sometimes can be out of order. In the past, message from kafka is parsed and directly insert or update ES doc with specific ID. To avoid the older data override the newer data, I have to check whether the doc with specific ID is already exists and some properties of this doc are meet the conditions. Then I do the UPDATE action(or INSERT).

What I'm doing now is 'search before update'.

Before updating a doc, I search from ES with specific ID(included in kafka msg). Then check if this doc meets the conditions(for example, whether update_time is older?). Lastly I update the doc. And I set refresh to true to update index instantly.

What I'm worried about?

It seems Transactional.

  1. If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?

  2. If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?

If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?

That is a possibility since indexes are refreshed once in every second (by default), reducing this value is neither recommended nor guaranteed to give you the desired result since Elasticsearch is NOT designed for this.

If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?

You can use script if the number of fields being updated are very limited. Personally I have found script to be best suited for single field update and that too for corner use cases, it should not be used as a general practice. Any more than that and you are running into the same risk as that with stored procedures in the RDBMS world. It makes data management volatile overall and a system which is harder to maintain/extend in the longer run.

Your use case is best suited for optimistic locking support available from Elasticsearch out of the box. Take a look at Elasticsearch Versioning Support for full details.

You can very well use the inbuilt doc version if concurrency is the only problem that you need to solve. If, however, you need more than concurrency (out of order message delivery and respective ES updates) then you should use your application/domain specific field as the inbuilt version wouldn't work as-is.

You can very well use any of the app specific (numeric) field as a version field and use it for optimistic locking during document updates. If you use this approach, please pay special attention to all insert, update, delete operations for that index. Quoting AS-IS from versioning support - when using external versioning, make sure you always add the current version (and version_type) to any index, update or delete calls. If you forget, Elasticsearch will use it's internal system to process that request, which will cause the version to be incremented erroneously

I'll recommend you evaluate the inbuilt version first and use it if it fulfills your needs. It'll make the overall design much simpler. Consider the app specific version as the second option if the inbuilt version does not meet your requirements.

  1. If there is only one Thread executing synchronously, is it possible that When I process next message the doc updated in last message process is not refresh at ES?

Ad 1. It is possible to save data in ElasticSearch and in a short while after receive stale result (before the index is updated)

  1. If I have several Threads consuming kafka message, how to check before update? Can I use script to solve this problem?

Ad 2. If you process Kafka messages in several threads, it would be the best to use business data (eg. some business ids) as partition keys in Kafka to ensure data is processed in order. Remember to use Kafka to consume messages in many threads and don't consume messages by single consumer to fan out later to multiple threads.

It seems it would be best to ensure data is processed in order and then drop checking in Elasticsearch since it is not guaranteed to give valid results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM