简体繁体 English

可靠的火灾忘记卡夫卡生产者实施战略

[英]Reliable fire-n-forget Kafka producer implementation strategy

原文 2017-05-31 15:28:03 0 1 apache-kafka/ message-queue/ messaging/ reliability/ producer

I'm in middle of a 1st mile problem with Kafka. 我正处于卡夫卡的第1英里问题中。 Everybody deals with partitioning, etc. but how to handle the 1st mile? 每个人都处理分区等，但如何处理第1英里？

My system consists of many applications producing events distributed on nodes. 我的系统由许多应用程序组成，这些应用程 I need to deliver these events to a set of applications acting as consumers in a reliable/fail-safe way. 我需要将这些事件传递给一组以可靠/故障安全方式充当消费者的应用程序。 The messaging system of choice is Kafka (due its log nature) but it's not set in stone. 选择的消息传递系统是Kafka（由于它的日志特性），但它不是一成不变的。

The events should be propagated in a decoupled fire-n-forget manner as most as possible. 这些事件应尽可能以分离的方式传播。 This means the producers should be fully responsible for reliable delivering their messages. 这意味着生产者应该对可靠传递信息负全部责任。 This means apps producing events shouldn't worry about the event delivery at all. 这意味着生成事件的应用程序根本不应该担心事件传递。

Producer's reliability schema has to account for: 制作人的可靠性架构必须考虑到：

box connection outage - during an outage producer can't access network at all; 盒连接中断 - 在停电期间， 生产者根本无法访问网络 ; Kafka cluster is thus not reachable 因此无法访问Kafka集群
box restart - both producer and event producing app restart (independently); box restart - 生成器和事件生成应用程序重启（独立）; producer should persist in-flight messages (during retrying, batching, etc.) 制作人应该坚持飞行中的消息（在重试，批处理等过程中）
internal Kafka exceptions - message size was too large ; 内部Kafka例外 - 邮件大小太大 ; serialization exception; 序列化异常; etc. 等等

No library I've examined so far covers these cases. 到目前为止，我检查过的图书馆都没有涵盖这些案例。 Is there a suggested strategy how to solve this? 有没有建议的策略如何解决这个问题？

I know there are retriable and non-retriable errors during Producer's send() . 我知道在Producer的send()期间有可重复且不可重试的错误 。 On those retriable, the library usually handles everything internally. 在那些可重复的，图书馆通常内部处理一切。 However, non-retriable ends with an exception in async callback... 但是，不可重复的结束在异步回调中有例外...

Should I blindly replay these to infinity? 我应该盲目地将这些重播到无穷大吗？ For network outages it should work but how about Kafka internal errors - say message too large. 对于网络中断它应该工作，但卡夫卡内部错误如何 - 说消息太大。 There might be a DeadLetterQueue -like mechanism + replay. 可能存在类似DeadLetterQueue的机制+重放。 However, how to deal with message count... 但是，如何处理消息计数...

About the persistence - a lightweight DB backend should solve this. 关于持久性 - 轻量级DB后端应该解决这个问题。 Just creating a persistent queue and then removing those already send/ACKed. 只需创建一个持久队列 ，然后删除那些已经发送/确认的队列。 However, I'm afraid that if it was this simple it would be already implemented in standard Kafka libraries long time ago. 但是，我担心如果这很简单，很久以前它就已经在标准的Kafka库中实现了。 Performance would probably go south. 表演可能会走向南方。

Seeing things like KAFKA-3686 or KAFKA-1955 makes me a bit worried. 看到像KAFKA-3686或KAFKA-1955这样的东西让我有点担心。

Thanks in advance. 提前致谢。

1 个解决方案

We have a production system whose primary use case is reliable message delivery. 我们有一个生产系统，其主要用例是可靠的消息传递。 I can't go in much detail, however i can share a high level design on how we achieve this. 我不能详细介绍，但是我可以分享一个关于如何实现这一目标的高级设计。 However this system is guarantees "atleast once delivery" messaging sematics. 然而，该系统保证“至少一次交付”消息传递语义。

Source 资源

First we designed a message schema, and all the message sent to this system must follow it. 首先，我们设计了一个消息模式，发送到该系统的所有消息都必须遵循它。
Then we write the message to the a mysql message table, which is sharded by date, with a field marked as delivered or not 然后我们将消息写入一个mysql消息表，该消息表按日期分片，标记为已传递的字段
We have a app constantly polling db, with rows marked un-delivered, picks up a row, constructs the message and send it to the load balancer, this is a blocking call and updates the message row to delivered, only when returned 200 In case of 5xx, the app will retry the message with sleep back off. 我们有一个应用程序不断轮询数据库，行标记为未交付，拾取一行，构造消息并将其发送到负载均衡器，这是一个阻塞调用并更新消息行以交付，仅在返回200时以防万一5xx的应用程序将重新启动睡眠状态的消息。 Also you can make the retries configurable as per your need. 您也可以根据需要进行重试。

Each source system maintains their own polling app and db. 每个源系统都维护自己的轮询应用程序和数据库。

Producer Array 制片人阵列

This is basically a array of machines under a load balancer waiting for incoming messages and produce those to the Kafka Cluster. 这基本上是负载均衡器下的一组机器，等待传入的消息并将其生成到Kafka集群。
We maintain 3 replicas of each topic and in the producer Config we keep acks = -1 , which is very important for your fire-n-forget requirement. 我们维护每个主题的3个副本，在生产者Config中我们保持acks = -1，这对于你的fire-n-forget要求非常重要。 As per the doc 根据文件

acks=all This means the leader will wait for the full set of in-sync replicas to acknowledge the record. acks = all这意味着领导者将等待完整的同步副本集以确认记录。 This guarantees that the record will not be lost as long as at least one in-sync replica remains alive. 这保证了只要至少一个同步副本仍然存活，记录就不会丢失。 This is the strongest available guarantee. 这是最强有力的保证。 This is equivalent to the acks=-1 setting 这相当于acks = -1设置

As I said producing is a blocking call, and it will return 2xx if the message is produced succesfully across all 3 replicas. 正如我所说，production是一个阻塞调用，如果消息在所有3个副本中成功生成，它将返回2xx。 4xx, if message is doesn't meet the schema requirements 5xx, if the kafka broker threw some exception. 4xx，如果消息不符合架构要求5xx，如果kafka经纪人抛出一些异常。

Consumer Array 消费者阵列

This is a normal array of machines, running Kafka High level Consumers for the topic's consumer groups. 这是一个正常的机器阵列，为主题的消费者群体运行Kafka高级消费者。

We are currently running this setup with few additional components for some other functional flows in production and it is basically fire-n-forget from the source point of view. 我们目前正在运行此设置，其中几个附加组件用于生产中的其他一些功能流程，从源代码的角度来看，它基本上是万无一失的。

This system addresses all of your concerns. 该系统解决了您的所有问题。

box connection outage : Unless the source polling app gets 2xx,it will produce again-again which may lead to duplicates. 框连接中断 ：除非源轮询应用程序获得2xx，否则它将再次产生，这可能导致重复。
box restart : Due to retry mechanism of the source , this shouldn't be a problem as well. box restart ：由于源的重试机制，这也不应该是一个问题。
internal Kafka exceptions : Taken care by polling app, as producer array will reply with 5xx unable to produce, and will be further retried. 内部Kafka例外 ：通过轮询应用程序小心，因为生产者数组将回复5xx无法生成，并将进一步重试。

Acks = -1, also ensures that all the replicas are in-sync and have a copy of the message, so broker going down will not be a issue as well. Acks = -1，也确保所有副本都是同步的并且具有消息的副本，因此代理下降也不会成为问题。