简体   繁体   English

处理 Kafka Producer 超时异常的指南?

[英]Guidelines to handle Timeout exception for Kafka Producer?

I often get Timeout exceptions due to various reasons in my Kafka producer.由于我的 Kafka 制作人的各种原因,我经常收到超时异常。 I am using all the default values for producer config currently.我目前正在使用生产者配置的所有默认值。

I have seen following Timeout exceptions:我见过以下超时异常:

org.apache.kafka.common.errors.TimeoutException: Failed to update metadata after 60000 ms. org.apache.kafka.common.errors.TimeoutException:60000 毫秒后无法更新元数据。

org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for topic-1-0: 30001 ms has passed since last append org.apache.kafka.common.errors.TimeoutException:主题 1-​​0 的 1 条记录到期:自上次追加以来已过去 30001 毫秒

I have following questions:我有以下问题:

  1. What are the general causes of these Timeout exceptions?这些超时异常的一般原因是什么?

    1. Temporary network issue临时网络问题
    2. Server issue?服务器问题? if yes then what kind of server issue?如果是,那么什么样的服务器问题?
  2. what are the general guidelines to handling the Timeout exception?处理超时异常的一般准则是什么?

    1. Set 'retries' config so that Kafka API does the retries?设置“重试”配置以便 Kafka API 进行重试?
    2. Increase 'request.timeout.ms' or 'max.block.ms' ?增加 'request.timeout.ms' 或 'max.block.ms' ?
    3. Catch the exception and have application layer retry sending the message but this seems hard with Async send as messages will then be sent out of order?捕获异常并让应用程序层重试发送消息,但这对于异步发送似乎很难,因为消息将被乱序发送?
  3. Are Timeout exceptions retriable exceptions and is it safe to retry them?超时异常是可重试的异常吗?重试它们是否安全?

I am using Kafka v2.1.0 and Java 11.我正在使用 Kafka v2.1.0 和 Java 11。

Thanks in advance.提前致谢。

"What are the general causes of these Timeout exceptions?" “这些超时异常的一般原因是什么?”

  1. The most common cause that I saw earlier was due to staled metadata information: one broker went down, and the topic partitions on that broker were failed over to other brokers.我之前看到的最常见的原因是由于元数据信息过时:一个代理出现故障,并且该代理上的主题分区故障转移到其他代理。 However, the topic metadata information has not been updated properly, and the client still tries to talk to the failed broker to either get metadata info, or to publish the message.但是,主题元数据信息尚未正确更新,客户端仍会尝试与失败的代理通信以获取元数据信息或发布消息。 That causes timeout exception.这会导致超时异常。

  2. Netwowrk connectivity issues.网络连接问题。 This can be easily diagnosed with telnet broker_host borker_port这可以通过telnet broker_host borker_port轻松诊断

  3. The broker is overloaded.经纪人超载。 This can happen if the broker is saturated with high workload, or hosts too many topic partitions.如果代理因高工作负载而饱和,或者托管太多主题分区,就会发生这种情况。

To handle the timeout exceptions, the general practice is:处理超时异常,一般的做法是:

  1. Rule out broker side issues.排除经纪人方面的问题。 make sure that the topic partitions are fully replicated, and the brokers are not overloaded确保主题分区完全复制,并且代理没有过载

  2. Fix host name resolution or network connectivity issues if there are any修复主机名解析或网络连接问题(如果有)

  3. Tune parameters such as request.timeout.ms , delivery.timeout.ms etc. My past experience was that the default value works fine in most of the cases.调整request.timeout.msdelivery.timeout.ms等参数。我过去的经验是默认值在大多数情况下都能正常工作。

The default Kafka config values, both for producers and brokers, are conservative enough that, under general circumstances, you shouldn't run into any timeouts.生产者和代理的默认 Kafka 配置值足够保守,在一般情况下,您不应该遇到任何超时。 Those problems typically point to a flaky/lossy network between the producer and the brokers.这些问题通常指向生产者和经纪人之间的脆弱/有损网络。

The exception you're getting, Failed to update metadata , usually means one of the brokers is not reachable by the producer, and the effect is that it cannot get the metadata.你得到的例外, Failed to update metadata ,通常意味着生产者无法访问其中一个经纪人,结果是它无法获得元数据。

For your second question, Kafka will automatically retry to send messages that were not fully ack'ed by the brokers.对于您的第二个问题,Kafka 将自动重试发送代理未完全确认的消息。 It's up to you if you want to catch and retry when you get a timeout on the application side, but if you're hitting 1+ min timeouts, retrying is probably not going to make much of a difference.如果您想在应用程序端超时时捕获并重试,这取决于您,但如果您达到 1 分钟以上的超时,重试可能不会产生太大影响。 You're going to have to figure out the underlying network/reachability problems with the brokers anyway.无论如何,您将不得不弄清楚代理的底层网络/可达性问题。

In my experience, usually the network problems are:根据我的经验,通常网络问题是:

  • Port 9092 is blocked by a firewall, either on the producer side or on the broker side, or somewhere in the middle (try nc -z broker-ip 9092 from the server running the producer)端口 9092 被防火墙阻止,无论是在生产者端或代理端,还是在中间的某个地方(从运行生产者的服务器尝试nc -z broker-ip 9092
  • DNS resolution is broken, so even though the port is open, the producer cannot resolve to an IP address. DNS 解析被破坏,因此即使端口是开放的,生产者也无法解析到 IP 地址。

I suggest to use the following properties while constructing Producer config我建议在构建 Producer config 时使用以下属性

Need acks from Partition - Leader需要来自分区的确认 - 领导者

kafka.acks=1卡夫卡.acks=1

Maximum number fo retries kafka producer will do to send message and recieve acks from Leader kafka 生产者发送消息和接收来自领导者的确认的最大重试次数

kafka.retries=3 kafka.retries=3

Request timeout for each indiviual request每个单独请求的请求超时

timeout.ms=200超时.ms=200

Wait to send next request again ;等待再次发送下一个请求; This is to avoid sending requests in tight loop;这是为了避免在紧密循环中发送请求;

retry.backoff.ms=50 retry.backoff.ms=50

Upper bound to finish all the retries完成所有重试的上限

dataLogger.kafka.delivery.timeout.ms=1200 dataLogger.kafka.delivery.timeout.ms=1200

producer.send(record, new Callback {
  override def onCompletion(recordMetadata: RecordMetadata, e: Exception): Unit = {
    if (e != null) {
      logger.debug(s"KafkaLogger : Message Sent $record to  Topic  ${recordMetadata.topic()}, Partition ${recordMetadata.partition()} , Offset ${recordMetadata.offset()} ")
    } else {
      logger.error(s"Exception while sending message $item to Error topic :$e")
    }
  }
})

Close the Producer with timeout超时关闭生产者

producer.close(1000, TimeUnit.MILLISECONDS) producer.close(1000, TimeUnit.MILLISECONDS)

The Timeout Exception would happen if the value of "advertised.listeners"(protocol://host:port) is not reachable by the producer or consumer如果生产者或消费者无法访问“advertised.listeners”(protocol://host:port)的值,则会发生超时异常

check the configuration of property "advertised.listeners" by the following command:通过以下命令检查属性“advertised.listeners”的配置:

cat $KAFKA_HOME/config/server.properties

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM