I'm trying to flush out some issues with the following data pipeline, and was hoping to get some opinions on any vulnerabilities in this design (which utilizes Filebeat, Kafka, Logstash, and Elasticsearch).
Find the most recent location for a given user, with a maximum of 45 seconds of lag time.
We have a Python application that continuously logs out the latest location for a user.
# log.json
{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}
{"user_id": 2, "location": "Chicago, IL"}
{"user_id": 1, "location": "Portland, OR"}
The idea is to write this data to Elasticsearch (a datastore we have good support for within our company), and use the "user_id" as the document ID so that if I perform these 2 inserts back to back:
{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}
Then querying Elasticsearch for "user_id" == 1 will return the latest location.
Filebeat -> Kafka -> Consumer (business logic)-> Kafka -> Logstash -> Elasticsearch
.
{"user_id": 1, "document_version": 1, "location": "San Francisco, CA"}
{"user_id": 1, "document_version": 2, "location": "New York City, NY"}
Is there any situation that can arise where the writes happen out of order?
Assuming you control the logging code - You could look at having the applications logging directly into Kafka, then using KSQL or Kafka Streams you can find your data using a 45 second time window, write data back to another Kafka topic, and finally use Kafka Connect's Elasticsearch output connector (or Logstash) to write to Elasticsearch. I don't know how flexible the Filebeat Kafka output is, but I think you need a "raw" topic, then subscribe to that one, "repartition" it into another one, then do your output processing following that.
You make events for a Kafka partition be in order by identifying your keys. For example, key by user ID, then all events for any given user end up ordered in Kafka
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.