Kafka 代理节点因“打开的文件太多”错误而宕机

Question

We have a 3 node Kafka cluster deployment, with a total of 35 topics with 50 partitions each.我们有一个 3 节点的 Kafka 集群部署，总共有 35 个主题，每个主题有 50 个分区。 In total, we have configured the replication factor=2 .总的来说，我们已经配置了replication factor=2 。 We are seeing a very strange problem that intermittently Kafka node stops responding with error:我们看到一个非常奇怪的问题，Kafka 节点间歇性地停止响应并显示错误：

ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: Too many open files
  at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
  at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
  at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
  at kafka.network.Acceptor.accept(SocketServer.scala:460)
  at kafka.network.Acceptor.run(SocketServer.scala:403)
  at java.lang.Thread.run(Thread.java:745)

We have deployed the latest Kafka version and using spring-kafka as client:我们已经部署了最新的 Kafka 版本并使用 spring-kafka 作为客户端：

kafka_2.12-2.1.0 (CentOS Linux release 7.6.1810 (Core)) kafka_2.12-2.1.0（CentOS Linux 版本 7.6.1810（核心））

There are three observations:有以下三点观察：
1. If we do lsof -p <kafka_pid>|wc -l , we get the total number of open descriptors as around 7000 only.如果我们执行lsof -p <kafka_pid>|wc -l ，我们得到的打开描述符总数只有 7000 左右。
2. If we just do lsof|grep kafka|wc -l , we get around 1.5 Million open FD's.如果我们只执行lsof|grep kafka|wc -l ，我们会得到大约 150 万个开放 FD。 We have checked they are all belonging to Kafka process only.我们已经检查过它们都只属于 Kafka 进程。
3. If we downgrade the system to Centos6, then the out of lsof|grep kafka|wc -l comes back to 7000.如果我们将系统降级到 Centos6，那么lsof|grep kafka|wc -l会回到 7000。

We have tried setting the file limits to very large, but still we get this issue.我们已尝试将文件限制设置为非常大，但仍然出现此问题。 Following is the limit set for the kafka process:以下是为 kafka 进程设置的限制：

cat /proc/<kafka_pid>/limits
    Limit                     Soft Limit           Hard Limit           Units
    Max cpu time              unlimited            unlimited            seconds
    Max file size             unlimited            unlimited            bytes
    Max data size             unlimited            unlimited            bytes
    Max stack size            8388608              unlimited            bytes
    Max core file size        0                    unlimited            bytes
    Max resident set          unlimited            unlimited            bytes
    Max processes             513395               513395               processes
    Max open files            500000               500000               files
    Max locked memory         65536                65536                bytes
    Max address space         unlimited            unlimited            bytes
    Max file locks            unlimited            unlimited            locks
    Max pending signals       513395               513395               signals
    Max msgqueue size         819200               819200               bytes
    Max nice priority         0                    0
    Max realtime priority     0                    0
    Max realtime timeout      unlimited            unlimited            us

We have few questions here:我们在这里有几个问题：

Why the broker is going down intermittently when we have already configured so large process limits?当我们已经配置了如此大的进程限制时，为什么代理会间歇性地停机？ Does kafka require even more available file descriptors? kafka 是否需要更多可用的文件描述符？
Why is there a difference in output of lsof and lsof -p in centos 6 and centos 7?为什么lsof和lsof -p在 centos 6 和 centos 7 中的输出有差异？
Is the number of broker nodes 3 less?代理节点的数量是不是少了 3？ Seeing the replication factor as 2, we have around 100 partitions per topic distributed among 3 nodes, thus around 33 partitions per node.将复制因子视为 2，每个主题大约有 100 个分区分布在 3 个节点中，因此每个节点大约有 33 个分区。

Edit 1: Seems like we are hitting the Kafka issue: https://issues.apache.org/jira/browse/KAFKA-7697编辑 1：似乎我们遇到了 Kafka 问题： https : //issues.apache.org/jira/browse/KAFKA-7697

We will plan to downgrade the Kafka version to 2.0.1.我们将计划将 Kafka 版本降级到 2.0.1。

Answer 1

Based on the earlier update of the asker he found that he was hitting https://issues.apache.org/jira/browse/KAFKA-7697根据提问者的较早更新，他发现他正在点击https://issues.apache.org/jira/browse/KAFKA-7697

A quick check now shows that it is resolved, and based on the jira it seems that the solution for this problem is using Kafka 2.1.1 and above.现在快速检查显示它已解决，并且基于 jira 似乎解决此问题的方法是使用 Kafka 2.1.1 及更高版本。

Kafka 代理节点因“打开的文件太多”错误而宕机

问题描述

1 个解决方案

解决方案1
0 2021-07-22 12:06:56

Kafka 代理节点因“打开的文件太多”错误而宕机

问题描述

1 个解决方案

解决方案1 0 2021-07-22 12:06:56

解决方案1
0 2021-07-22 12:06:56