简体   繁体   中英

Data pipeline design considerations (Filebeat, Kafka, Logstash, Elasticsearch)

I'm trying to flush out some issues with the following data pipeline, and was hoping to get some opinions on any vulnerabilities in this design (which utilizes Filebeat, Kafka, Logstash, and Elasticsearch).

Goal

Find the most recent location for a given user, with a maximum of 45 seconds of lag time.

Idea

We have a Python application that continuously logs out the latest location for a user.

# log.json
{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}
{"user_id": 2, "location": "Chicago, IL"}
{"user_id": 1, "location": "Portland, OR"}

The idea is to write this data to Elasticsearch (a datastore we have good support for within our company), and use the "user_id" as the document ID so that if I perform these 2 inserts back to back:

{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}

Then querying Elasticsearch for "user_id" == 1 will return the latest location.

Current Pipeline

Filebeat -> Kafka -> Consumer (business logic)-> Kafka -> Logstash -> Elasticsearch

Known limitations:

  • Message order must be preserved through the entire pipeline (this means filebeat must run with a single harvester)
  • Sensitive to lag during multiple parts of the pipeline

Questions:

  • Are there additional limitations to the above design that I haven't considered?
  • Since we're explicitly using a document_id (set to the "user_id" of each record), writes should be sent to the same Elasticsearch shard. But even if these records are sent to the same ES shard in the following order, with explicit document versions and external_gte specificed (Note: logstash uses the bulk API):

.

{"user_id": 1, "document_version": 1, "location": "San Francisco, CA"}
{"user_id": 1, "document_version": 2, "location": "New York City, NY"}

Is there any situation that can arise where the writes happen out of order?

Assuming you control the logging code - You could look at having the applications logging directly into Kafka, then using KSQL or Kafka Streams you can find your data using a 45 second time window, write data back to another Kafka topic, and finally use Kafka Connect's Elasticsearch output connector (or Logstash) to write to Elasticsearch. I don't know how flexible the Filebeat Kafka output is, but I think you need a "raw" topic, then subscribe to that one, "repartition" it into another one, then do your output processing following that.

You make events for a Kafka partition be in order by identifying your keys. For example, key by user ID, then all events for any given user end up ordered in Kafka

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM