对于具有 2 个分区和 2 个副本的主题，如何处理具有两个节点的 kafka 领导者故障转移？

Question

I'm playing with kafka in a multi node environment to test how failover works.我在多节点环境中使用 kafka 来测试故障转移的工作原理。 Actually, i have 2 VMs with 1 kafka node inside each VM, and only 1 zookeeper inside one of the two VMs.实际上，我有 2 个 VM，每个 VM 内有 1 个 kafka 节点，而两个 VM 之一内只有 1 个 zookeeper。 I know that there is not an optimal production configuration, but it's just to train myself and understand things better.我知道没有最佳的生产配置，但它只是为了训练自己并更好地理解事物。

Here is my configuration: VM1 ip: 192.168.64.2 (With only one broker with broker.id=2) VM2 ip: 192.168.64.3 (With zookeeper running here and broker with broker.id=1)这是我的配置： VM1 ip: 192.168.64.2（只有一个 broker.id=2 的代理） VM2 ip: 192.168.64.3（在这里运行 zookeeper，broker.id=1）

I start kafka through podman (this is not a problem with podman, everything is well configured)我通过podman启动kafka（这不是podman的问题，一切都配置好）

On VM1:在 VM1 上：

podman run -e KAFKA_BROKER_ID=2 -e KAFKA_ZOOKEEPER_CONNECT=192.168.64.3:2181 -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9093,PLAINTEXT_HOST://192.168.64.2:29092 -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=2 -e KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT_HOST -e UNCLEAN_LEADER_ELECTION_ENABLE=true --pod zookeeper-kafka confluentinc/cp-kafka:latest

On VM2:在 VM2 上：

podman run -e KAFKA_BROKER_ID=1 -e KAFKA_ZOOKEEPER_CONNECT=localhost:2181 -e KAFKA_ADVERTISED_LISTENERS=PLAINTEXT://localhost:9092,PLAINTEXT_HOST://192.168.64.3:29092 -e KAFKA_LISTENER_SECURITY_PROTOCOL_MAP=PLAINTEXT:PLAINTEXT,PLAINTEXT_HOST:PLAINTEXT -e KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=2 -e KAFKA_INTER_BROKER_LISTENER_NAME=PLAINTEXT_HOST -e UNCLEAN_LEADER_ELECTION_ENABLE=true  --pod zookeeper-kafka confluentinc/cp-kafka:latest

Now i create a topic "orders":现在我创建一个主题“订单”：

./kafktopics --create --bootstrap-server 192.168.64.2:29092,192.168.64.3:29092 --replication-factor 2 --partitions 2 --topic orders

Then i create a producer:然后我创建一个生产者：

./kafkconsole-producer --broker-list 192.168.64.2:29092,192.168.64.3:29092 --topic orders

And a consumer:还有一个消费者：

./kafka-console-consumer --bootstrap-server 192.168.64.2:29092,192.168.64.3:29092 --topic orders```

Here is what i am try to do:这是我正在尝试做的事情：

Start Zookeeper, the 2 kafka nodes, create the "order" topic, the producer and the consumer (OK, everything works well)启动Zookeeper，2个kafka节点，创建“订单”主题，生产者和消费者（OK，一切正常）
Send message in my producer and check the consumer receive it (OK)在我的生产者中发送消息并检查消费者是否收到它（OK）
Kill the kafka node on VM2 (OK)杀死VM2上的kafka节点（OK）
Send again a message in my producer and check the consumer receive it (OK, the broker on VM1 can distribute the message)在我的生产者中再次发送一条消息并检查消费者是否收到它（好的，VM1 上的代理可以分发消息）
Restart the killed kafka node on VM2 (OK. After that i can see that the 2 partitions have VM1 as the leader)在 VM2 上重新启动被杀死的 kafka 节点（好的。之后我可以看到 2 个分区以 VM1 作为领导者）
Send again a message in my producer and check the consumer receive id (OK)在我的生产者中再次发送一条消息并检查消费者接收 id (OK)
Kill the kafka node on VM1, which is the leader of the 2 partitions now (OK)杀死VM1上的kafka节点，现在是2个分区的leader（OK）
Send again a message in my producer and check the consumer receive it (OK, the broker on VM2 can distribute the message)在我的生产者中再次发送一条消息并检查消费者是否收到它（OK，VM2 上的代理可以分发消息）
Restart the killed kafka node on VM1 (OK. After that i can see that the 2 partitions have VM2 as the leader)在 VM1 上重新启动被杀死的 kafka 节点（好的。之后我可以看到 2 个分区以 VM2 作为领导者）
Send again a message in my producer and check the consumer receive it (OK)在我的生产者中再次发送一条消息并检查消费者是否收到它（OK）
Kill again the kafka node on VM2 (OK)再次杀死VM2上的kafka节点（OK）
Send again a message in my producer and check the consumer receive it (NOT OK): Here, the producer can't send the message, and my consumer never receive the message, After a few period: i get an error in my producer :在我的生产者中再次发送一条消息并检查消费者是否收到它（不正常）：这里，生产者无法发送消息，我的消费者从未收到消息，经过一段时间后：我的生产者出现错误：

ERROR Error when sending message to topic orders with key: null, value: 9 bytes with error: (org.apache.kafka.clients.producer.internals.ErrorLoggingCallback)
org.apache.kafka.common.errors.TimeoutException: Expiring 1 record(s) for orders-0:120000 ms has passed since batch creation

I really don't understand what is happening here?我真的不明白这里发生了什么？ It works well at the beginning, but after start/stop/start broker, it start to fail.一开始它运行良好，但在启动/停止/启动代理之后，它开始失败。 I need to precise that i never kill the 2 broker at the same time.我需要准确地说我永远不会同时杀死 2 个经纪人。

Could you please explain me what i am missing here?你能解释一下我在这里缺少什么吗？

Thank you all:)谢谢你们：）

EDIT编辑

To complete comments below:要完成以下评论：

@OneCricketeer, I put the answer of your comment here. @OneCricketeer，我把你评论的答案放在这里。

At startup, when all it's fine:在启动时，一切都很好：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 2   Replicas: 2,1   Isr: 2,1
    Topic: orders   Partition: 1    Leader: 1   Replicas: 1,2   Isr: 1,2

After killing VM2:杀死VM2后：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 2   Replicas: 2,1   Isr: 2
    Topic: orders   Partition: 1    Leader: 2   Replicas: 1,2   Isr: 2

After killing VM1:杀死VM1后：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 1   Replicas: 2,1   Isr: 1
    Topic: orders   Partition: 1    Leader: 1   Replicas: 1,2   Isr: 1

After killing VM2:杀死VM2后：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 2   Replicas: 2,1   Isr: 2
    Topic: orders   Partition: 1    Leader: 2   Replicas: 1,2   Isr: 2

(From here, the producer can't publish message anymore) （从这里开始，生产者不能再发布消息了）

Answer 1

I put the answer of your comment here.我把你评论的答案放在这里。

At startup, when all it's fine:在启动时，一切都很好：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 2   Replicas: 2,1   Isr: 2,1
    Topic: orders   Partition: 1    Leader: 1   Replicas: 1,2   Isr: 1,2

After killing VM2:杀死VM2后：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 2   Replicas: 2,1   Isr: 2
    Topic: orders   Partition: 1    Leader: 2   Replicas: 1,2   Isr: 2

After killing VM1:杀死VM1后：

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 1   Replicas: 2,1   Isr: 1
    Topic: orders   Partition: 1    Leader: 1   Replicas: 1,2   Isr: 1

After killing VM2: (From here, the producer can't publish message anymore)杀死VM2后：（从这里，生产者不能再发布消息了）

Topic: orders   TopicId: I3hMNln9TpSuo76xHSpMXQ PartitionCount: 2   ReplicationFactor: 2    Configs:
    Topic: orders   Partition: 0    Leader: 2   Replicas: 2,1   Isr: 2
    Topic: orders   Partition: 1    Leader: 2   Replicas: 1,2   Isr: 2

Answer 2

After a long time of reading and investigating things about kafka, i finally found the answer of my problem.经过长时间阅读和调查有关kafka的事情，我终于找到了我的问题的答案。

With only 2 broker, i need the following configuration只有 2 个代理，我需要以下配置

KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR=2
KAFKA_OFFSETS_TOPIC_NUM_PARTITIONS=1

The problem was the default number of partitions for the offset topics.问题是偏移主题的默认分区数。 (it was 49 of 50 if i remember well). （如果我没记错的话，那是 50 次中的 49 次）。

Now with only one partition and 2 replicas, everything works well and i can start/stop/start/stop/.... my brokers as many time as i want, and the other broker take the lead and continue to handle my messages.现在只有一个分区和 2 个副本，一切正常，我可以启动/停止/启动/停止/....

Hope that could help someone in the future.希望这可以帮助将来的人。

对于具有 2 个分区和 2 个副本的主题，如何处理具有两个节点的 kafka 领导者故障转移？

问题描述

1 个解决方案

解决方案1
0 2021-12-19 16:02:03

解决方案2
0 已采纳 2021-12-23 16:36:14

对于具有 2 个分区和 2 个副本的主题，如何处理具有两个节点的 kafka 领导者故障转移？

问题描述

1 个解决方案

解决方案1 0 2021-12-19 16:02:03

解决方案2 0 已采纳 2021-12-23 16:36:14

解决方案1
0 2021-12-19 16:02:03

解决方案2
0 已采纳 2021-12-23 16:36:14