如何以稳健的方式处理 kafka 发布失败

Question

I'm using Kafka and we have a use case to build a fault tolerant system where not even a single message should be missed.我正在使用 Kafka，我们有一个用例来构建一个容错系统，其中一条消息都不应该被遗漏。 So here's the problem: If publishing to Kafka fails due to any reason (ZooKeeper down, Kafka broker down etc) how can we robustly handle those messages and replay them once things are back up again.所以这里的问题是：如果由于任何原因（ZooKeeper 关闭、Kafka 代理关闭等）发布到 Kafka 失败，我们如何稳健地处理这些消息并在事情再次恢复后重播它们。 Again as I say we cannot afford even a single message failure.再次正如我所说，我们甚至无法承受一条消息失败。 Another use case is we also need to know at any given point in time how many messages were failed to publish to Kafka due to any reason ie something like counter functionality and now those messages needs to be re-published again.另一个用例是我们还需要在任何给定的时间点知道有多少消息由于任何原因而未能发布到 Kafka，例如计数器功能之类的东西，现在这些消息需要再次重新发布。

One of the solution is to push those messages to some database (like Cassandra where writes are very fast but we also need counter functionality and I guess Cassandra counter functionality is not that great and we don't want to use that.) which can handle that kind of load and also provide us with the counter facility which is very accurate.解决方案之一是将这些消息推送到某个数据库（例如 Cassandra，其中写入速度非常快，但我们也需要计数器功能，我猜 Cassandra 计数器功能不是那么好，我们不想使用它。）它可以处理那种负载，还为我们提供了非常准确的计数器设施。

This question is more from architecture perspective and then which technology to use to make that happen.这个问题更多是从架构的角度来看，然后是使用哪种技术来实现这一点。

PS: We handle some where like 3000TPS. PS：我们处理一些像 3000TPS 的地方。 So when system start failing those failed messages can grow very fast in very short time.因此，当系统开始失败时，这些失败的消息会在很短的时间内快速增长。 We're using java based frameworks.我们正在使用基于 Java 的框架。

Thanks for your help!感谢您的帮助！

Answer 1

The reason Kafka was built in a distributed, fault-tolerant way is to handle problems exactly like yours, multiple failures of core components should avoid service interruptions. Kafka采用分布式、容错的方式构建的原因是为了处理和你一样的问题，核心组件的多个故障应该避免服务中断。 To avoid a down Zookeeper, deploy at least 3 instances of Zookeepers (if this is in AWS, deploy them across availability zones).为避免 Zookeeper 宕机，请至少部署 3 个 Zookeeper 实例（如果在 AWS 中，请跨可用区部署它们）。 To avoid broker failures, deploy multiple brokers, and ensure you're specifying multiple brokers in your producer bootstrap.servers property.为避免代理失败，请部署多个代理，并确保在生产者bootstrap.servers属性中指定多个代理。 To ensure that the Kafka cluster has written your message in a durable manor, ensure that the acks=all property is set in the producer.为确保 Kafka 集群以持久的方式写入您的消息，请确保在生产者中设置acks=all属性。 This will acknowledge a client write when all in-sync replicas acknowledge reception of the message (at the expense of throughput).当所有同步副本确认接收到消息（以吞吐量为代价）时，这将确认客户端写入。 You can also set queuing limits to ensure that if writes to the broker start backing up you can catch an exception and handle it and possibly retry.您还可以设置排队限制以确保如果写入代理开始备份，您可以捕获异常并处理它并可能重试。

Using Cassandra (another well thought out distributed, fault tolerant system) to "stage" your writes doesn't seem like it adds any reliability to your architecture, but does increase the complexity, plus Cassandra wasn't written to be a message queue for a message queue, I would avoid this.使用 Cassandra（另一个经过深思熟虑的分布式容错系统）来“暂存”您的写入似乎并没有为您的架构增加任何可靠性，但确实增加了复杂性，而且 Cassandra 没有被编写为消息队列消息队列，我会避免这种情况。

Properly configured, Kafka should be available to handle all your message writes and provide suitable guarantees.正确配置后，Kafka 应该可用于处理您的所有消息写入并提供适当的保证。

Answer 2

I am super late to the party.我参加聚会超级迟到。 But I see something missing in above answers :)但我发现上面的答案中缺少一些东西:)

The strategy of choosing some distributed system like Cassandra is a decent idea.选择像 Cassandra 这样的分布式系统的策略是一个不错的主意。 Once the Kafka is up and normal, you can retry all the messages that were written into this.一旦Kafka启动并正常，您可以重试写入其中的所有消息。

I would like to answer on the part of "knowing how many messages failed to publish at a given time"我想从“知道有多少消息在给定时间发布失败”来回答

From the tags, I see that you are using apache-kafka and kafka-consumer-api .You can write a custom call back for your producer and this call back can tell you if the message has failed or successfully published.从标签中，我看到您正在使用apache-kafka和kafka-consumer-api 。您可以为您的生产者编写一个自定义回调，这个回调可以告诉您消息是失败还是成功发布。 On failure, log the meta data for the message.失败时，记录消息的元数据。

Now, you can use log analyzing tools to analyze your failures.现在，您可以使用日志分析工具来分析您的故障。 One such decent tool is Splunk. Splunk 就是这样一种不错的工具。

Below is a small code snippet than can explain better about the call back I was talking about:下面是一个小代码片段，可以更好地解释我正在谈论的回调：

public class ProduceToKafka {

  private ProducerRecord<String, String> message = null;

 // TracerBulletProducer class has producer properties
  private KafkaProducer<String, String> myProducer = TracerBulletProducer
      .createProducer();

  public void publishMessage(String string) {

    ProducerRecord<String, String> message = new ProducerRecord<>(
        "topicName", string);

    myProducer.send(message, new MyCallback(message.key(), message.value()));
  }

  class MyCallback implements Callback {

    private final String key;
    private final String value;

    public MyCallback(String key, String value) {
      this.key = key;
      this.value = value;
    }


    @Override
    public void onCompletion(RecordMetadata metadata, Exception exception) {
      if (exception == null) {
        log.info("--------> All good !!");
      } else {
        log.info("--------> not so good  !!");
        log.info(metadata.toString());
        log.info("" + metadata.serializedValueSize());
        log.info(exception.getMessage());

      }
    }
  }

}

If you analyze the number of "--------> not so good !!"如果你分析"--------> not so good !!" logs per time unit, you can get the required insights.每个时间单位的日志，您可以获得所需的见解。

God speed !神速！

Answer 3

Chris already told about how to keep the system fault tolerant. Chris 已经讲过如何保持系统容错。

Kafka by default supports at-least once message delivery semantics, it means when it try to send a message something happens, it will try to resend it. Kafka 默认支持at-least once消息传递语义，这意味着当它尝试发送消息时，它会尝试重新发送它。

When you create a Kafka Producer properties, you can configure this by setting retries option more than 0.当您创建Kafka Producer属性时，您可以通过将retries选项设置为大于 0 来进行配置。

 Properties props = new Properties();
 props.put("bootstrap.servers", "localhost:4242");
 props.put("acks", "all");
 props.put("retries", 0);
 props.put("batch.size", 16384);
 props.put("linger.ms", 1);
 props.put("buffer.memory", 33554432);
 props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
 props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");

 Producer<String, String> producer = new KafkaProducer<>(props);

For more info check this .有关更多信息，请检查此。

如何以稳健的方式处理 kafka 发布失败

问题描述

3 个解决方案

解决方案1
5 2016-10-21 19:26:39

解决方案2
5 2020-01-08 12:34:48

解决方案3
4 2016-10-23 04:53:44

如何以稳健的方式处理 kafka 发布失败

问题描述

3 个解决方案

解决方案1 5 2016-10-21 19:26:39

解决方案2 5 2020-01-08 12:34:48

解决方案3 4 2016-10-23 04:53:44

解决方案1
5 2016-10-21 19:26:39

解决方案2
5 2020-01-08 12:34:48

解决方案3
4 2016-10-23 04:53:44