卡夫卡生产者超时异常

Question

I am running a Samza stream job that is writing data to Kafka topic.我正在运行将数据写入 Kafka 主题的 Samza 流作业。 Kafka is running a 3 node cluster. Kafka 正在运行一个 3 节点集群。 Samza job is deployed on yarn. Samza 作业部署在纱线上。 We are seeing lot of these exceptions in container logs :我们在容器日志中看到了很多这些异常：

 INFO [2018-10-16 11:14:19,410] [U:2,151,F:455,T:2,606,M:2,658] samza.container.ContainerHeartbeatMonitor:[ContainerHeartbeatMonitor:stop:61] - [main] - Stopping ContainerHeartbeatMonitor
ERROR [2018-10-16 11:14:19,410] [U:2,151,F:455,T:2,606,M:2,658] samza.runtime.LocalContainerRunner:[LocalContainerRunner:run:107] - [main] - Container stopped with Exception. Exiting process now.
org.apache.samza.SamzaException: org.apache.samza.SamzaException: Unable to send message from TaskName-Partition 15 to system kafka.
        at org.apache.samza.task.AsyncRunLoop.run(AsyncRunLoop.java:147)
        at org.apache.samza.container.SamzaContainer.run(SamzaContainer.scala:694)
        at org.apache.samza.runtime.LocalContainerRunner.run(LocalContainerRunner.java:104)
        at org.apache.samza.runtime.LocalContainerRunner.main(LocalContainerRunner.java:149)
Caused by: org.apache.samza.SamzaException: Unable to send message from TaskName-Partition 15 to system kafka.
        at org.apache.samza.system.kafka.KafkaSystemProducer$$anon$1.onCompletion(KafkaSystemProducer.scala:181)
        at org.apache.kafka.clients.producer.internals.RecordBatch.done(RecordBatch.java:109)
        at org.apache.kafka.clients.producer.internals.RecordBatch.maybeExpire(RecordBatch.java:160)
        at org.apache.kafka.clients.producer.internals.RecordAccumulator.abortExpiredBatches(RecordAccumulator.java:245)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:212)
        at org.apache.kafka.clients.producer.internals.Sender.run(Sender.java:135)
        at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.kafka.common.errors.TimeoutException: Expiring 5 record(s) for Topic3-16 due to 30332 ms has passed since last attempt plus backoff time

These 3 types of exceptions are coming a lot.这 3 种类型的异常经常出现。

59088 org.apache.kafka.common.errors.TimeoutException: Expiring 115 record(s) for Topic3-1 due to 30028 ms has passed since last attempt plus backoff time

61015 org.apache.kafka.common.errors.TimeoutException: Expiring 60 record(s) for Topic3-1 due to 74949 ms has passed since batch creation plus linger time

62275 org.apache.kafka.common.errors.TimeoutException: Expiring 176 record(s) for Topic3-4 due to 74917 ms has passed since last append

Please help me understand what is the issue here.请帮助我了解这里的问题。 Whenever its happened Samza container is getting restarted.每当它发生时，Samza 容器都会重新启动。

Answer 1

The error indicates that some records are put into the queue at a faster rate than they can be sent from the client.该错误表明某些记录以比从客户端发送的速度更快的速度放入队列。

When your Producer sends messages, they are stored in buffer (before sending the to the target broker) and the records are grouped together into batches in order to increase throughput.当您的 Producer 发送消息时，它们会存储在缓冲区中（在将消息发送到目标代理之前），并且将记录分组在一起以增加吞吐量。 When a new record is added to the batch, it must be sent within a -configurable- time window which is controlled by request.timeout.ms (the default is set to 30 seconds).当新记录添加到批处理中时，它必须在由request.timeout.ms控制的 -configurable- 时间窗口内发送（默认设置为 30 秒）。 If the batch is in the queue for longer time, a TimeoutException is thrown and the batch records will then be removed from the queue and won't be delivered to the broker.如果批处理在队列中的时间更长，则会抛出TimeoutException ，然后批处理记录将从队列中删除并且不会传递给代理。

Increasing the value of request.timeout.ms should do the trick for you.增加request.timeout.ms的值应该对你request.timeout.ms 。

In case this does not work, you can also try decreasing batch.size so that batches are sent more often (but this time will include fewer messages) and make sure that linger.ms is set to 0 (which is the default value).如果这不起作用，您还可以尝试减小batch.size以便更频繁地发送批次（但这次将包含更少的消息）并确保linger.ms设置为 0（这是默认值）。

Note that you need to restart your kafka brokers after changing any configuration parameter.请注意，您需要在更改任何配置参数后重新启动 kafka 代理。

If you still get the error I assume that something wrong is going on with your network.如果您仍然收到错误，我认为您的网络出现问题。 Have you enabled SSL?您是否启用了 SSL？

Answer 2

We tried everything, but no luck. 我们尝试了一切，但没有运气。

Decreased producer batch size and increased request.timeout.ms. 减少生产者批次大小并增加request.timeout.ms。
Restarted target kafka cluster, still no luck. 重新启动目标kafka群集，仍然没有运气。
Checked replication on target kafka cluster, that as well was working fine. 检查了目标kafka群集上的复制，也可以正常工作。
Added retries, retries.backout.ms in prodcuer properties. 在产品属性中添加了重试，retries.backout.ms。
Added linger.time as well in kafka prodcuer properties. 在kafka产品属性中添加了linger.time。

Finally our case there was issue with kafka cluster itself, from 2 servers we were unable to fetch metadata in between. 最后，我们的案例是kafka群集本身存在问题，无法从2个服务器之间获取元数据。

When we changed target kafka cluster to our dev box, it worked fine. 当我们将目标kafka集群更改为我们的开发箱时，它运行良好。

卡夫卡生产者超时异常

问题描述

1 个解决方案

解决方案1
10 已采纳 2018-11-09 10:36:04

解决方案2
0 2019-11-27 11:11:22

卡夫卡生产者超时异常

问题描述

1 个解决方案

解决方案1 10 已采纳 2018-11-09 10:36:04

解决方案2 0 2019-11-27 11:11:22

解决方案1
10 已采纳 2018-11-09 10:36:04

解决方案2
0 2019-11-27 11:11:22