Kafka 生产者无法发送带有 NOT_LEADER_FOR_PARTITION 异常的消息

Question

We're using spring-cloud-stream-binder-kafka (3.0.3.RELEASE) to send messages to our Kafka cluster (2.4.1).我们正在使用 spring-cloud-stream-binder-kafka (3.0.3.RELEASE) 向我们的 Kafka 集群 (2.4.1) 发送消息。 Every now and then one of the producer threads receives NOT_LEADER_FOR_PARTITION exceptions, and even exceeds the retries (currently set at 12, activated by dependency spring-retry).不时有一个生产者线程收到 NOT_LEADER_FOR_PARTITION 异常，甚至超过重试次数（当前设置为 12，由依赖 spring-retry 激活）。 We've restricted the retries because we're sending about 1k msg/s (per producer instance) and were worried about the size of the buffer.我们限制了重试，因为我们发送了大约 1k msg/s（每个生产者实例）并且担心缓冲区的大小。 This way we're regularly loosing messages, which is bad for downstream consumers, because we can't simply reproduce the incoming traffic.这样我们会经常丢失消息，这对下游消费者不利，因为我们不能简单地复制传入的流量。

The error message is错误信息是


[Producer clientId=producer-5] Received invalid metadata error in produce request on partition topic-21 due to org.apache.kafka.common.errors.NotLeaderForPartitionException: This server is not the leader for that topic-partition.. Going to request metadata update now
[Producer clientId=producer-5] Got error produce response with correlation id 974706 on topic-partition topic-21, retrying (8 attempts left). Error: NOT_LEADER_FOR_PARTITION
[Producer clientId=producer-5] Got error produce response with correlation id 974707 on topic-partition topic-21, retrying (1 attempts left). Error: NOT_LEADER_FOR_PARTITION

Any known way to avoid this?有什么已知的方法可以避免这种情况吗？ Should we go back to the default of MAX_INT retries?我们是否应该将 go 恢复为默认的 MAX_INT 重试？ Why does it keep sending to the same broker, even though it responded with NOT_LEADER_FOR_PARTITION?为什么它一直发送到同一个代理，即使它用 NOT_LEADER_FOR_PARTITION 响应？

Any hints are welcome.欢迎任何提示。

EDIT: We just noticed that the broker metric kafka_network_requestmetrics_responsequeuetimems goes up around that time, but the max we've seen is around 2.5s编辑：我们刚刚注意到代理指标 kafka_network_requestmetrics_responsequeuetimems 大约在那个时候上升，但我们看到的最大值约为 2.5s

Answer 1

Both Produce and Fetch requests are send to the leader replica of the partition. Produce 和 Fetch 请求都发送到分区的领导副本。 NotLeaderForPartitionException the exception is thrown when the request is sent to the partition which not the leader replica of the partition now. NotLeaderForPartitionException 当请求被发送到现在不是该分区的领导副本的分区时抛出异常。

The client maintains the information regarding the leader of each partition as a cache.客户端将有关每个分区的领导者的信息作为缓存进行维护。 The complete process of cache management is shown below.缓存管理的完整过程如下图所示。

The client needs to refresh this information by setting the metadata.max.age.ms in producer configuration.客户端需要通过在生产者配置中设置metadata.max.age.ms来刷新这些信息。 The default value of this tag is 300000 ms此标签的默认值为 300000 毫秒

You can go through the following Apache Kafka documentation.您可以通过以下 Apache Kafka 文档 go。

https://kafka.apache.org/documentation/ https://kafka.apache.org/documentation/

Please go through the Sender.java code.请通过 Sender.java 代码 go。

https://github.com/a0x8o/kafka/blob/master/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java https://github.com/a0x8o/kafka/blob/master/clients/src/main/java/org/apache/kafka/clients/producer/internals/Sender.java

You will find both the error messages in the sender code.您将在发件人代码中找到这两条错误消息。 The default value of metadata.max.age.ms is 3 seconds. metadata.max.age.ms的默认值为 3 秒。 I think you should reduce this value and then observe the behavior.我认为你应该减少这个值，然后观察行为。

Answer 2

you need config listeners properly你需要正确配置监听器

I'm using docker-compose like我正在使用 docker-compose 之类的

services:
  zookeeper:
    container_name: zookeeper
    ports:
      - "2181:2181"
    ...
  broker-1:
    hostname: "broker-1.mydomain.com"
    ports:
      - "29091:29091"
    ...
  broker-2:
    hostname: "broker-2.mydomain.com"
    container_name: broker-2
    ports:
      - "29092:29092"

edit server.properties for each broker编辑每个代理的 server.properties

broker-1经纪人-1

listeners: PRIVATE_HOSTNAME://broker-1.mydomain.com:9092,PUBLIC_HOSTNAME://broker-1.mydomain.com:29091
advertised.listeners: PRIVATE_HOSTNAME://broker-1.mydomain.com:9092,PUBLIC_HOSTNAME://broker-1.mydomain.com:29091
listener.security.protocol.map: PUBLIC_HOSTNAME:PLAINTEXT,PRIVATE_HOSTNAME:PLAINTEXT
inter.broker.listener.name: PRIVATE_HOSTNAME

broker-2经纪人-2

listeners: PRIVATE_HOSTNAME://broker-2.mydomain.com:9092, PUBLIC_HOSTNAME://broker-2.mydomain.com:29092
advertised.listeners: PRIVATE_HOSTNAME://broker-2.mydomain.com:9092, PUBLIC_HOSTNAME://broker-2.mydomain.com:29092
listener.security.protocol.map: PUBLIC_HOSTNAME:PLAINTEXT, PRIVATE_HOSTNAME:PLAINTEXT
inter.broker.listener.name: PRIVATE_HOSTNAME

IMPORTANT: note that I'm using the same hostname for private and public net, because the consumer/producer can only access to kafka by register name like this:重要提示：请注意，我对私有网络和公共网络使用相同的hostname ，因为消费者/生产者只能通过注册名称访问 kafka，如下所示：

    [BrokerToControllerChannelManager broker=1 name=forwarding]: Recorded new controller, from now on will use broker broker-1.mydomain.com:9092
...
    [BrokerToControllerChannelManager broker=2 name=forwarding]: Recorded new controller, from now on will use broker broker-2.mydomain.com:9092

edit your host /etc/hosts编辑您的主机 /etc/hosts

127.0.0.1   broker-1.mydomain.com
127.0.0.1   broker-2.mydomain.com

Kafka 生产者无法发送带有 NOT_LEADER_FOR_PARTITION 异常的消息

问题描述

2 个解决方案

解决方案1
3 2020-05-14 16:35:21

解决方案2
0 2022-09-12 13:18:31

Kafka 生产者无法发送带有 NOT_LEADER_FOR_PARTITION 异常的消息

问题描述

2 个解决方案

解决方案1 3 2020-05-14 16:35:21

解决方案2 0 2022-09-12 13:18:31

解决方案1
3 2020-05-14 16:35:21

解决方案2
0 2022-09-12 13:18:31