数据管道设计注意事项（Filebeat，Kafka，Logstash，Elasticsearch）

Question

I'm trying to flush out some issues with the following data pipeline, and was hoping to get some opinions on any vulnerabilities in this design (which utilizes Filebeat, Kafka, Logstash, and Elasticsearch). 我正在尝试清除以下数据管道中的一些问题，并希望对此设计中的任何漏洞（利用Filebeat，Kafka，Logstash和Elasticsearch）获得一些意见。

Goal 目标

Find the most recent location for a given user, with a maximum of 45 seconds of lag time. 查找给定用户的最新位置，最长延迟时间为45秒。

Idea 理念

We have a Python application that continuously logs out the latest location for a user. 我们有一个Python应用程序，可以连续注销用户的最新位置。

# log.json
{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}
{"user_id": 2, "location": "Chicago, IL"}
{"user_id": 1, "location": "Portland, OR"}

The idea is to write this data to Elasticsearch (a datastore we have good support for within our company), and use the "user_id" as the document ID so that if I perform these 2 inserts back to back: 这个想法是将这些数据写入Elasticsearch（我们公司内部对它有很好的支持的数据存储），并使用“ user_id”作为文档ID，这样，如果我执行这2次插入，便会背靠背：

{"user_id": 1, "location": "San Francisco, CA"}
{"user_id": 1, "location": "New York City, NY"}

Then querying Elasticsearch for "user_id" == 1 will return the latest location. 然后在Elasticsearch中查询“ user_id” == 1将返回最新位置。

Current Pipeline 当前管道

Filebeat -> Kafka -> Consumer (business logic)-> Kafka -> Logstash -> Elasticsearch

Known limitations: 已知限制：

Message order must be preserved through the entire pipeline (this means filebeat must run with a single harvester) 消息顺序必须在整个管道中保留（这意味着filebeat必须使用单个收集器运行）
Sensitive to lag during multiple parts of the pipeline 在管道的多个部分敏感滞后

Questions: 问题：

Are there additional limitations to the above design that I haven't considered? 我没有考虑过上述设计的其他限制吗？
Since we're explicitly using a document_id (set to the "user_id" of each record), writes should be sent to the same Elasticsearch shard. 由于我们显式地使用document_id（设置为每个记录的“ user_id”），因此应将写入发送到相同的Elasticsearch分片。 But even if these records are sent to the same ES shard in the following order, with explicit document versions and external_gte specificed (Note: logstash uses the bulk API): 但是，即使这些记录按以下顺序发送到相同的ES分片，并且具有明确的文档版本和特定的external_gte （注意：logstash使用批量API）：

. 。

{"user_id": 1, "document_version": 1, "location": "San Francisco, CA"}
{"user_id": 1, "document_version": 2, "location": "New York City, NY"}

Is there any situation that can arise where the writes happen out of order? 写入发生乱序是否会发生任何情况？

Answer 1

Assuming you control the logging code - You could look at having the applications logging directly into Kafka, then using KSQL or Kafka Streams you can find your data using a 45 second time window, write data back to another Kafka topic, and finally use Kafka Connect's Elasticsearch output connector (or Logstash) to write to Elasticsearch. 假设您控制着日志代码-您可以看看让应用程序直接登录到Kafka，然后使用KSQL或Kafka Streams，可以在45秒的时间内找到数据，将数据写回到另一个Kafka主题，最后使用Kafka Connect的Elasticsearch输出连接器（或Logstash）以写入Elasticsearch。 I don't know how flexible the Filebeat Kafka output is, but I think you need a "raw" topic, then subscribe to that one, "repartition" it into another one, then do your output processing following that. 我不知道Filebeat Kafka输出的灵活性如何，但是我认为您需要一个“原始”主题，然后订阅该主题，将其“重新分区”为另一个主题，然后进行后续的输出处理。

You make events for a Kafka partition be in order by identifying your keys. 您可以通过识别密钥使Kafka分区的事件井然有序。 For example, key by user ID, then all events for any given user end up ordered in Kafka 例如，按用户ID键，然后任何给定用户的所有事件最终在Kafka中排序

数据管道设计注意事项（Filebeat，Kafka，Logstash，Elasticsearch）

问题描述

Goal 目标

Idea 理念

Current Pipeline 当前管道

Known limitations: 已知限制：

Questions: 问题：

1 个解决方案

解决方案1
0 2018-03-27 11:35:11

数据管道设计注意事项（Filebeat，Kafka，Logstash，Elasticsearch）

问题描述

Goal 目标

Idea 理念

Current Pipeline 当前管道

Known limitations: 已知限制：

Questions: 问题：

1 个解决方案

解决方案1 0 2018-03-27 11:35:11

解决方案1
0 2018-03-27 11:35:11