Kafka stops after a certain period of time, lost brokers

Question

I have 4 kafka running with debezium. After some days running well, three kafka machines get out of the network for a period of time and, on connectDistributed.out log file I have a lot of messages with the following error:

[2020-05-04 13:27:02,526] WARN [Consumer clientId=connector-consumer-sink-warehouse-6, 
groupId=connect-sink-warehouse] 133 partitions have leader brokers without a matching listener,
 including [SCT010-2, SC2010-2, SC1010-0, SC1010-1, SF4010-0, SUB010-0, SUB010-1, SWP010-0, 
SWP010-1, ACO010-2] (org.apache.kafka.clients.NetworkClient:1044)

I have 4 Kafka machines, brokers from 0 to 3

192.168.240.70 - Broker 0
192.168.240.71 - Broker 1
192.168.240.72 - Broker 2
192.168.240.73 - Broker 3

Zookeeper:

192.168.240.70

Follow my server.properties - There are the same, except the listeners , advertised.listeners that points for the same IP of the machine that Kafka is installed and broker.id that must be unique (from 0 to 3):

broker.id=0
listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
advertised.listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,INTERNAL:PLAINTEXT
control.plane.listener.name=CONTROLLER
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/william/kafka/data/kafka/
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
log.retention.hours=150
log.retention.bytes=200000000000
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=192.168.240.70:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=3

The kafka topics (configs/offsets and status) shows problems in replication. There are some related with listeners config?

About the connectors health:

And, on kafka connect. only one broker are presented:

How can I fix this error? Seems to be something related to leader election or finding the leader after a long time without broker access.

Answer 1

After a research I found the problem. So, to help here, follow the concept about the problem:

When we create a distributed Kafka systems, we distribute the topics around the brokers and create leader electors. In my case, I have 4 brokers that Zookeeper chooses on the fly who will be the leader of certain topics.

As three of four Kafka servers was down for more than a hour, Zookeeper tried to reach and can't reach the leader. As my config points the need to replicate to three brokers the same topic, Zookeeper can't maintain the topic healthy.

We have the rebalance config: group.initial.rebalance.delay.ms=3 that try to rebalance at every 3 seconds. And we have a limited number of tries to reconnect before one Kafka broker was down. The tries was occurred and Zookeeper can't reached the brokers lost.

In other way, the brokers was not down, they're only cant't reach Zookeeper by network problems so, after a time, the tries from Zookeeper was stopped and, after a time, Kafka brokers are reachable again but, the tries to rebalance are stopped.

Just, simply restarting my brokers to tell to zookeeper to reconnect solved my problem because, on restart, Kafka brokers tell to zookeeper: -I'm here, waiting for your instructions. And Zookeeper, recognizing the leaders lost, reconnect everything on the right place.

So, trying to be helpful.

Kafka stops after a certain period of time, lost brokers

Question

1 answers

solution1
0 ACCPTED 2020-05-05 14:53:27

Kafka stops after a certain period of time, lost brokers

Question

1 answers

solution1 0 ACCPTED 2020-05-05 14:53:27

solution1
0 ACCPTED 2020-05-05 14:53:27