[英]Kafka broker not attempting to restart after session timeout
I've 3 brokers containers and 1 zookeeper container running in a docker stack, both zookeeper container and brokers containers keeps stopping after some days (less than a week) running in an idle state.我有3 个 brokers容器和1 个 zookeeper容器在 docker 堆栈中运行,zookeeper 容器和 brokers容器在空闲 state 运行几天(不到一周)后都会停止。
This is one of the brokers logs where i see an error, but i can not identify how to handle it这是我看到错误的代理日志之一,但我无法确定如何处理它
[2022-07-13 02:22:30,109] INFO [UnifiedLog partition=messages-processed-0, dir=/bitnami/kafka/data] Truncating to 6 has no effect as the largest offset in the log is 5 (kafka.log.UnifiedLog)
[2022-07-13 02:23:33,766] INFO [Controller id=1] Newly added brokers: , deleted brokers: 2, bounced brokers: , all live brokers: 1,3 (kafka.controller.KafkaController)
[2022-07-13 02:23:33,766] INFO [RequestSendThread controllerId=1] Shutting down (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Stopped (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Shutdown completed (kafka.controller.RequestSendThread)
[2022-07-13 02:23:34,226] INFO [Controller id=1] Broker failure callback for 2 (kafka.controller.KafkaController)
[2022-07-13 02:23:34,227] INFO [Controller id=1 epoch=3] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
[2022-07-13 02:23:36,414] ERROR [Controller id=1 epoch=3] Controller 1 epoch 3 failed to change state for partition __consumer_offsets-30 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-30 under strategy OfflinePartitionLeaderElectionStrategy(false)
at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:424)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:421)
at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:332)
at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:238)
at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158)
at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:74)
at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:59)
at kafka.controller.KafkaController.onReplicasBecomeOffline(KafkaController.scala:627)
at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:597)
at kafka.controller.KafkaController.processBrokerChange(KafkaController.scala:1621)
at kafka.controller.KafkaController.process(KafkaController.scala:2495)
at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:52)
at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:130)
at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:133)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:133)
at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)
This is a part of the zookeeper log at the same time这同时也是zookeeper日志的一部分
2022-07-13 02:14:45,002 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760000, timeout of 18000ms exceeded
2022-07-13 02:15:13,001 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760007, timeout of 18000ms exceeded
2022-07-13 02:15:29,832 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760006 for client /172.18.0.5:55356, probably expired
2022-07-13 02:15:42,419 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760000 for client /172.18.0.4:59474, probably expired
2022-07-13 02:15:52,350 [myid:] - INFO [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760007 for client /172.18.0.6:34406, probably expired
2022-07-13 02:16:49,001 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000b, timeout of 18000ms exceeded
2022-07-13 02:17:12,434 [myid:] - INFO [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000b for client /172.18.0.5:56264, probably expired
2022-07-13 02:19:17,067 [myid:] - WARN [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:60150, session = 0x10000fba776000d
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-13 02:23:29,002 [myid:] - INFO [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000e, timeout of 18000ms exceeded
2022-07-13 02:24:05,059 [myid:] - INFO [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000e for client /172.18.0.5:32886, probably expired
2022-07-13 03:48:55,209 [myid:] - WARN [NIOWorkerThread-2:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:33508, session = 0x10000fba776000d
at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
I've seen that a maxSessionTimeout of 40000ms is set on my zoo.cfg (zookeeper side) and a timeout of 18000ms is set on the server.properties at broker side, should i increment one of those?我已经看到在我的 zoo.cfg(动物园管理员端)上设置了 40000 毫秒的 maxSessionTimeout,并且在代理端的 server.properties 上设置了 18000 毫秒的超时,我应该增加其中一个吗?
**kafka-topic --describe ** about one of the topics that fell https://prnt.sc/r3hU5wv3jK-h **kafka-topic --describe **关于落下的话题之一https://prnt.sc/r3hU5wv3jK-h
In your uploaded image, it appears brokerId=1
has crashed on its own.在您上传的图片中,
brokerId=1
似乎已自行崩溃。 And probably not related to the logs you've shown.并且可能与您显示的日志无关。
The logs that you did show ("Failed to elect leader") are happening on the other brokers because ReplicationFactor: 1
on __consumer_offsets
.您确实显示的日志(“无法选举领导者”)正在其他代理上发生,因为
ReplicationFactor: 1
on __consumer_offsets
。 Thus why you have Leader: None, Replicas: 1
.因此,为什么你有
Leader: None, Replicas: 1
。 In a healthy Kafka cluster, no partition leaders should be none.在一个健康的 Kafka 集群中,没有分区领导者应该是 none。
This can be fixed for an existing Kafka cluster using kafka-reassign-partitions
, but since you are using Docker, the best way to fix this would be to first clear all the data, stop the containers, then add your missing KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3
environment variable and restart them all.这可以使用
kafka-reassign-partitions
修复现有 Kafka 集群,但由于您使用的是 Docker,因此解决此问题的最佳方法是首先清除所有数据,停止容器,然后添加缺少的KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3
环境变量并重新启动它们。
Or as mentioned in the comments, don't use 3 brokers on a single host.或者如评论中所述,不要在单个主机上使用 3 个代理。 And if you are using a single host, timeouts aren't really relevant because all network requests are local.
如果您使用的是单个主机,则超时并不真正相关,因为所有网络请求都是本地的。
You can try the following.您可以尝试以下方法。
docker volume rm sentry-kafka
docker volume rm sentry-zookeeper
docker volume rm sentry_onpremise_sentry-kafka-log
docker volume rm sentry_onpremise_sentry-zookeeper-log
./install.sh // to create Kafka partitions
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.