Kafka 代理在 session 超时后未尝试重新启动

Question

I've 3 brokers containers and 1 zookeeper container running in a docker stack, both zookeeper container and brokers containers keeps stopping after some days (less than a week) running in an idle state.我有3 个 brokers容器和1 个 zookeeper容器在 docker 堆栈中运行，zookeeper 容器和 brokers容器在空闲 state 运行几天（不到一周）后都会停止。

This is one of the brokers logs where i see an error, but i can not identify how to handle it这是我看到错误的代理日志之一，但我无法确定如何处理它

[2022-07-13 02:22:30,109] INFO [UnifiedLog partition=messages-processed-0, dir=/bitnami/kafka/data] Truncating to 6 has no effect as the largest offset in the log is 5 (kafka.log.UnifiedLog)
[2022-07-13 02:23:33,766] INFO [Controller id=1] Newly added brokers: , deleted brokers: 2, bounced brokers: , all live brokers: 1,3 (kafka.controller.KafkaController)
[2022-07-13 02:23:33,766] INFO [RequestSendThread controllerId=1] Shutting down (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Stopped (kafka.controller.RequestSendThread)
[2022-07-13 02:23:33,853] INFO [RequestSendThread controllerId=1] Shutdown completed (kafka.controller.RequestSendThread)
[2022-07-13 02:23:34,226] INFO [Controller id=1] Broker failure callback for 2 (kafka.controller.KafkaController)
[2022-07-13 02:23:34,227] INFO [Controller id=1 epoch=3] Sending UpdateMetadata request to brokers Set() for 0 partitions (state.change.logger)
[2022-07-13 02:23:36,414] ERROR [Controller id=1 epoch=3] Controller 1 epoch 3 failed to change state for partition __consumer_offsets-30 from OfflinePartition to OnlinePartition (state.change.logger)
kafka.common.StateChangeFailedException: Failed to elect leader for partition __consumer_offsets-30 under strategy OfflinePartitionLeaderElectionStrategy(false)
    at kafka.controller.ZkPartitionStateMachine.$anonfun$doElectLeaderForPartitions$7(PartitionStateMachine.scala:424)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
    at kafka.controller.ZkPartitionStateMachine.doElectLeaderForPartitions(PartitionStateMachine.scala:421)
    at kafka.controller.ZkPartitionStateMachine.electLeaderForPartitions(PartitionStateMachine.scala:332)
    at kafka.controller.ZkPartitionStateMachine.doHandleStateChanges(PartitionStateMachine.scala:238)
    at kafka.controller.ZkPartitionStateMachine.handleStateChanges(PartitionStateMachine.scala:158)
    at kafka.controller.PartitionStateMachine.triggerOnlineStateChangeForPartitions(PartitionStateMachine.scala:74)
    at kafka.controller.PartitionStateMachine.triggerOnlinePartitionStateChange(PartitionStateMachine.scala:59)
    at kafka.controller.KafkaController.onReplicasBecomeOffline(KafkaController.scala:627)
    at kafka.controller.KafkaController.onBrokerFailure(KafkaController.scala:597)
    at kafka.controller.KafkaController.processBrokerChange(KafkaController.scala:1621)
    at kafka.controller.KafkaController.process(KafkaController.scala:2495)
    at kafka.controller.QueuedEvent.process(ControllerEventManager.scala:52)
    at kafka.controller.ControllerEventManager$ControllerEventThread.process$1(ControllerEventManager.scala:130)
    at kafka.controller.ControllerEventManager$ControllerEventThread.$anonfun$doWork$1(ControllerEventManager.scala:133)
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
    at kafka.metrics.KafkaTimer.time(KafkaTimer.scala:31)
    at kafka.controller.ControllerEventManager$ControllerEventThread.doWork(ControllerEventManager.scala:133)
    at kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:96)

This is a part of the zookeeper log at the same time这同时也是zookeeper日志的一部分

2022-07-13 02:14:45,002 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760000, timeout of 18000ms exceeded
2022-07-13 02:15:13,001 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba7760007, timeout of 18000ms exceeded
2022-07-13 02:15:29,832 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760006 for client /172.18.0.5:55356, probably expired
2022-07-13 02:15:42,419 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760000 for client /172.18.0.4:59474, probably expired
2022-07-13 02:15:52,350 [myid:] - INFO  [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba7760007 for client /172.18.0.6:34406, probably expired
2022-07-13 02:16:49,001 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000b, timeout of 18000ms exceeded
2022-07-13 02:17:12,434 [myid:] - INFO  [NIOWorkerThread-2:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000b for client /172.18.0.5:56264, probably expired
2022-07-13 02:19:17,067 [myid:] - WARN  [NIOWorkerThread-1:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:60150, session = 0x10000fba776000d
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)
2022-07-13 02:23:29,002 [myid:] - INFO  [SessionTracker:o.a.z.s.ZooKeeperServer@632] - Expiring session 0x10000fba776000e, timeout of 18000ms exceeded
2022-07-13 02:24:05,059 [myid:] - INFO  [NIOWorkerThread-1:o.a.z.s.ZooKeeperServer@1087] - Invalid session 0x10000fba776000e for client /172.18.0.5:32886, probably expired
2022-07-13 03:48:55,209 [myid:] - WARN  [NIOWorkerThread-2:o.a.z.s.NIOServerCnxn@371] - Unexpected exception
org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /172.18.0.4:33508, session = 0x10000fba776000d
    at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:170)
    at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:333)
    at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:508)
    at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:153)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    at java.base/java.lang.Thread.run(Thread.java:829)

I've seen that a maxSessionTimeout of 40000ms is set on my zoo.cfg (zookeeper side) and a timeout of 18000ms is set on the server.properties at broker side, should i increment one of those?我已经看到在我的 zoo.cfg（动物园管理员端）上设置了 40000 毫秒的 maxSessionTimeout，并且在代理端的 server.properties 上设置了 18000 毫秒的超时，我应该增加其中一个吗？

**kafka-topic --describe ** about one of the topics that fell https://prnt.sc/r3hU5wv3jK-h **kafka-topic --describe **关于落下的话题之一https://prnt.sc/r3hU5wv3jK-h

Image used for the broker container用于代理容器的图像

Answer 1

In your uploaded image, it appears brokerId=1 has crashed on its own.在您上传的图片中， brokerId=1似乎已自行崩溃。 And probably not related to the logs you've shown.并且可能与您显示的日志无关。

The logs that you did show ("Failed to elect leader") are happening on the other brokers because ReplicationFactor: 1 on __consumer_offsets .您确实显示的日志（“无法选举领导者”）正在其他代理上发生，因为ReplicationFactor: 1 on __consumer_offsets 。 Thus why you have Leader: None, Replicas: 1 .因此，为什么你有Leader: None, Replicas: 1 。 In a healthy Kafka cluster, no partition leaders should be none.在一个健康的 Kafka 集群中，没有分区领导者应该是 none。

This can be fixed for an existing Kafka cluster using kafka-reassign-partitions , but since you are using Docker, the best way to fix this would be to first clear all the data, stop the containers, then add your missing KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3 environment variable and restart them all.这可以使用kafka-reassign-partitions修复现有 Kafka 集群，但由于您使用的是 Docker，因此解决此问题的最佳方法是首先清除所有数据，停止容器，然后添加缺少的KAFKA_CFG_OFFSETS_TOPIC_REPLICATION_FACTOR=3环境变量并重新启动它们。

Or as mentioned in the comments, don't use 3 brokers on a single host.或者如评论中所述，不要在单个主机上使用 3 个代理。 And if you are using a single host, timeouts aren't really relevant because all network requests are local.如果您使用的是单个主机，则超时并不真正相关，因为所有网络请求都是本地的。

Answer 2

You can try the following.您可以尝试以下方法。

docker volume rm sentry-kafka
docker volume rm sentry-zookeeper
docker volume rm sentry_onpremise_sentry-kafka-log
docker volume rm sentry_onpremise_sentry-zookeeper-log


./install.sh // to create Kafka partitions

Kafka 代理在 session 超时后未尝试重新启动

问题描述

2 个解决方案

解决方案1
1 已采纳 2022-07-28 17:35:24

解决方案2
0 2022-07-28 06:08:37

Kafka 代理在 session 超时后未尝试重新启动

问题描述

2 个解决方案

解决方案1 1 已采纳 2022-07-28 17:35:24

解决方案2 0 2022-07-28 06:08:37

解决方案1
1 已采纳 2022-07-28 17:35:24

解决方案2
0 2022-07-28 06:08:37