[英]Kafka broker node goes down with “Too many open files” error
We have a 3 node Kafka cluster deployment, with a total of 35 topics with 50 partitions each.我们有一个 3 节点的 Kafka 集群部署,总共有 35 个主题,每个主题有 50 个分区。 In total, we have configured the
replication factor=2
.总的来说,我们已经配置了
replication factor=2
。 We are seeing a very strange problem that intermittently Kafka node stops responding with error:我们看到一个非常奇怪的问题,Kafka 节点间歇性地停止响应并显示错误:
ERROR Error while accepting connection (kafka.network.Acceptor)
java.io.IOException: Too many open files
at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
at sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
at kafka.network.Acceptor.accept(SocketServer.scala:460)
at kafka.network.Acceptor.run(SocketServer.scala:403)
at java.lang.Thread.run(Thread.java:745)
We have deployed the latest Kafka version and using spring-kafka as client:我们已经部署了最新的 Kafka 版本并使用 spring-kafka 作为客户端:
kafka_2.12-2.1.0 (CentOS Linux release 7.6.1810 (Core))
kafka_2.12-2.1.0(CentOS Linux 版本 7.6.1810(核心))
lsof -p <kafka_pid>|wc -l
, we get the total number of open descriptors as around 7000 only.lsof -p <kafka_pid>|wc -l
,我们得到的打开描述符总数只有 7000 左右。lsof|grep kafka|wc -l
, we get around 1.5 Million open FD's.lsof|grep kafka|wc -l
,我们会得到大约 150 万个开放 FD。 We have checked they are all belonging to Kafka process only.lsof|grep kafka|wc -l
comes back to 7000.lsof|grep kafka|wc -l
会回到 7000。 We have tried setting the file limits to very large, but still we get this issue.我们已尝试将文件限制设置为非常大,但仍然出现此问题。 Following is the limit set for the kafka process:
以下是为 kafka 进程设置的限制:
cat /proc/<kafka_pid>/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 513395 513395 processes
Max open files 500000 500000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 513395 513395 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
We have few questions here:我们在这里有几个问题:
lsof
and lsof -p
in centos 6 and centos 7?lsof
和lsof -p
在 centos 6 和 centos 7 中的输出有差异? Edit 1: Seems like we are hitting the Kafka issue: https://issues.apache.org/jira/browse/KAFKA-7697编辑 1:似乎我们遇到了 Kafka 问题: https : //issues.apache.org/jira/browse/KAFKA-7697
We will plan to downgrade the Kafka version to 2.0.1.我们将计划将 Kafka 版本降级到 2.0.1。
Based on the earlier update of the asker he found that he was hitting https://issues.apache.org/jira/browse/KAFKA-7697根据提问者的较早更新,他发现他正在点击https://issues.apache.org/jira/browse/KAFKA-7697
A quick check now shows that it is resolved, and based on the jira it seems that the solution for this problem is using Kafka 2.1.1 and above.现在快速检查显示它已解决,并且基于 jira 似乎解决此问题的方法是使用 Kafka 2.1.1 及更高版本。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.