[英]Data pipeline design considerations (Filebeat, Kafka, Logstash, Elasticsearch)
I'm trying to flush out some issues with the following data pipeline, and was hoping to get some opinions on any vulnerabilities in this design (which utilizes Filebeat, Kafka, Logstash, and Elasticsearch). 我正在尝试清除以下数据管道中的一些问题,并希望对此设计中的任何漏洞(利用Filebeat,Kafka,Logstash和Elasticsearch)获得一些意见。
Find the most recent location for a given user, with a maximum of 45 seconds of lag time. 查找给定用户的最新位置,最长延迟时间为45秒。
We have a Python application that continuously logs out the latest location for a user. 我们有一个Python应用程序,可以连续注销用户的最新位置。
# log.json
{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}
{"user_id": 2, "location": "Chicago, IL"}
{"user_id": 1, "location": "Portland, OR"}
The idea is to write this data to Elasticsearch (a datastore we have good support for within our company), and use the "user_id" as the document ID so that if I perform these 2 inserts back to back: 这个想法是将这些数据写入Elasticsearch(我们公司内部对它有很好的支持的数据存储),并使用“ user_id”作为文档ID,这样,如果我执行这2次插入,便会背靠背:
{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}
Then querying Elasticsearch for "user_id" == 1 will return the latest location. 然后在Elasticsearch中查询“ user_id” == 1将返回最新位置。
Filebeat -> Kafka -> Consumer (business logic)-> Kafka -> Logstash -> Elasticsearch
. 。
{"user_id": 1, "document_version": 1, "location": "San Francisco, CA"}
{"user_id": 1, "document_version": 2, "location": "New York City, NY"}
Is there any situation that can arise where the writes happen out of order? 写入发生乱序是否会发生任何情况?
Assuming you control the logging code - You could look at having the applications logging directly into Kafka, then using KSQL or Kafka Streams you can find your data using a 45 second time window, write data back to another Kafka topic, and finally use Kafka Connect's Elasticsearch output connector (or Logstash) to write to Elasticsearch. 假设您控制着日志代码-您可以看看让应用程序直接登录到Kafka,然后使用KSQL或Kafka Streams,可以在45秒的时间内找到数据,将数据写回到另一个Kafka主题,最后使用Kafka Connect的Elasticsearch输出连接器(或Logstash)以写入Elasticsearch。 I don't know how flexible the Filebeat Kafka output is, but I think you need a "raw" topic, then subscribe to that one, "repartition" it into another one, then do your output processing following that.
我不知道Filebeat Kafka输出的灵活性如何,但是我认为您需要一个“原始”主题,然后订阅该主题,将其“重新分区”为另一个主题,然后进行后续的输出处理。
You make events for a Kafka partition be in order by identifying your keys. 您可以通过识别密钥使Kafka分区的事件井然有序。 For example, key by user ID, then all events for any given user end up ordered in Kafka
例如,按用户ID键,然后任何给定用户的所有事件最终在Kafka中排序
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.