简体   繁体   中英

Kafka stops after a certain period of time, lost brokers

I have 4 kafka running with debezium. After some days running well, three kafka machines get out of the network for a period of time and, on connectDistributed.out log file I have a lot of messages with the following error:

[2020-05-04 13:27:02,526] WARN [Consumer clientId=connector-consumer-sink-warehouse-6, 
groupId=connect-sink-warehouse] 133 partitions have leader brokers without a matching listener,
 including [SCT010-2, SC2010-2, SC1010-0, SC1010-1, SF4010-0, SUB010-0, SUB010-1, SWP010-0, 
SWP010-1, ACO010-2] (org.apache.kafka.clients.NetworkClient:1044)

I have 4 Kafka machines, brokers from 0 to 3

192.168.240.70 - Broker 0
192.168.240.71 - Broker 1
192.168.240.72 - Broker 2
192.168.240.73 - Broker 3

Zookeeper:

192.168.240.70

Follow my server.properties - There are the same, except the listeners , advertised.listeners that points for the same IP of the machine that Kafka is installed and broker.id that must be unique (from 0 to 3):

broker.id=0
listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
advertised.listeners=CONTROLLER://192.168.240.70:9091,INTERNAL://192.168.240.70:9092
listener.security.protocol.map=CONTROLLER:PLAINTEXT,INTERNAL:PLAINTEXT
control.plane.listener.name=CONTROLLER
inter.broker.listener.name=INTERNAL
num.network.threads=3
num.io.threads=8
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
log.dirs=/home/william/kafka/data/kafka/
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=1
log.retention.hours=150
log.retention.bytes=200000000000
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

zookeeper.connect=192.168.240.70:2181
zookeeper.connection.timeout.ms=6000
group.initial.rebalance.delay.ms=3

The kafka topics (configs/offsets and status) shows problems in replication. There are some related with listeners config?

复制信息下的kafka经理

About the connectors health:

kafka connect 上的连接器运行状况

And, on kafka connect. only one broker are presented:

卡夫卡经纪人

How can I fix this error? Seems to be something related to leader election or finding the leader after a long time without broker access.

After a research I found the problem. So, to help here, follow the concept about the problem:

When we create a distributed Kafka systems, we distribute the topics around the brokers and create leader electors. In my case, I have 4 brokers that Zookeeper chooses on the fly who will be the leader of certain topics.

As three of four Kafka servers was down for more than a hour, Zookeeper tried to reach and can't reach the leader. As my config points the need to replicate to three brokers the same topic, Zookeeper can't maintain the topic healthy.

We have the rebalance config: group.initial.rebalance.delay.ms=3 that try to rebalance at every 3 seconds. And we have a limited number of tries to reconnect before one Kafka broker was down. The tries was occurred and Zookeeper can't reached the brokers lost.

In other way, the brokers was not down, they're only cant't reach Zookeeper by network problems so, after a time, the tries from Zookeeper was stopped and, after a time, Kafka brokers are reachable again but, the tries to rebalance are stopped.

Just, simply restarting my brokers to tell to zookeeper to reconnect solved my problem because, on restart, Kafka brokers tell to zookeeper: -I'm here, waiting for your instructions. And Zookeeper, recognizing the leaders lost, reconnect everything on the right place.

So, trying to be helpful.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM