简体   繁体   English

kafka 同步:“java.io.IOException:打开的文件太多”

[英]kafka Synchronization :“java.io.IOException: Too many open files”

we are having problem with Kafka.我们遇到了卡夫卡的问题。 sometimes Suddenly, without warning we go out of Synchronization and start to get exceptions when emitting events.有时突然,我们在没有警告的情况下退出同步并在发出事件时开始出现异常。

the exception we are getting is我们得到的例外是

java.io.IOException: Too many open files

it seems this is a generic exception thrown by Kafka in many cases.在许多情况下,这似乎是 Kafka 抛出的通用异常。 We investigated it a little and we think the root cause is when trying to emit events to some topic, it fails because kafka dosen't have a leader partition for this topic我们对其进行了一些调查,我们认为根本原因是在尝试向某个主题发出事件时,它失败了,因为 kafka 没有该主题的领导分区

can someone help ?有人可以帮忙吗?

I assume that you are on Linux.我假设您使用的是 Linux。 If that is the case, then what's happening is that you are running out of open file descriptors.如果是这种情况,那么发生的情况是您的打开文件描述符用完了。 The real question is why this is happening.真正的问题是为什么会发生这种情况。

Linux by default generally keeps this number fairly low.默认情况下,Linux 通常会将这个数字保持在相当低的水平。 You can check the actual value via ulimit:您可以通过 ulimit 检查实际值:

ulimit -a | grep "open files"

You can then set that value via, again ulimit:然后,您可以再次通过 ulimit 设置该值:

sudo ulimit -n 4096

That said, unless the Kafka host in question has lots of topics / partitions it is unusual to hit that limit.也就是说,除非有问题的 Kafka 主机有很多主题/分区,否则达到该限制是不寻常的。 What's probably happening is that some other process is keeping files or connections open.可能发生的事情是其他一些进程保持文件或连接打开。 In order to figure out which process you're going to have to do some detective work with lsof.为了找出哪个进程,您将不得不使用 lsof 进行一些侦查工作。

One of the case this happens is that when you have big partition number, because each partition maps to a directory in file system in broker that consist of two files.发生这种情况的一种情况是,当您有很大的分区号时,因为每个分区映射到代理中文件系统中的一个目录,该目录由两个文件组成。 one of them is for the index and another is for data.其中一个用于索引,另一个用于数据。 broker opens both files.经纪人打开这两个文件。 so more partition numbers there is more open files.所以更多的分区号有更多的打开文件。 as Doomy said you can increase the open files in linux but this config is not permanent and when you close the session this config will disappear.正如 Doomy 所说,您可以增加 linux 中打开的文件,但此配置不是永久性的,当您关闭会话时,此配置将消失。 and in next loggin if you check with this command如果您使用此命令进行检查,则在下次登录时

ulimit -a | grep "open files"

you can see the older number.你可以看到旧的号码。 but with this config you can make it permanent:但是使用此配置,您可以使其永久化:

open this file:打开这个文件:

sudo nano /etc/pam.d/common-session

and add this line:并添加这一行:

session required pam_limits.so

after that you can set the limit in limits.config as this:之后,您可以在limits.config中设置限制如下:

sudo nano /etc/security/limits.conf

and then you can set limit in this file forexample然后你可以在这个文件中设置限制例如

* soft nofile 80000

or any hard config.或任何硬配置。 after that close your session and check again the limit of open files之后关闭您的会话并再次检查打开文件的限制

I had similar "java.io.IOException: Too many open files" issue on Linux/CentOS.我在 Linux/CentOS 上有类似的“java.io.IOException: Too many open files”问题。 In my case, after checking open fd's with isof , it was kafka-web-console which was opening too many connections.就我而言,在使用isof检查打开 fd 后,是 kafka-web-console 打开了太多连接。 Stopping that solved my problem.停止这解决了我的问题。

In our case our Kafka topics were accidentally configured "segment.ms" = 20000 and were generating new log segments every 20 seconds when the default is 604800000 (1 week).在我们的例子中,我们的 Kafka 主题被意外配置为"segment.ms" = 20000并且在默认值为 604800000(1 周)时每 20 秒生成一次新的日志段。

We are using amazon's msk, so we didn't have the ability ourselves to run the commands, however amazon support was able to monitor it for us.我们正在使用亚马逊的 msk,所以我们自己没有能力运行这些命令,但是亚马逊支持人员能够为我们监控它。 That caused this issue, but then some of the nodes were not recovering.这导致了这个问题,但随后一些节点没有恢复。

We took two steps..我们走了两步..

1) Force Compaction 1) 强制压实

We set the retention and ratio low to clean up我们将保留率和比率设置得较低以进行清理

"delete.retention.ms" = 100
"min.cleanable.dirty.ratio" = "0.01"

One of the nodes, was able to recover... but another did not seem to recover to the point where Kafka would actually run compaction, it seemed to be the "leader" on one of the largest topics.其中一个节点能够恢复……但另一个节点似乎没有恢复到 Kafka 实际运行压缩的程度,它似乎是最大主题之一的“领导者”。

2) Free up space 2)释放空间

We decided to destroy the large topic in the hopes it would unblock the node.我们决定销毁大主题,希望它能解除节点的阻塞。 Eventually the compaction seemed to run on all the nodes.最终,压缩似乎在所有节点上运行。

Later we restored the topic that we destroyed with the new segmentation settings, and have been running fine since.后来我们用新的分段设置恢复了我们破坏的主题,并且从那时起一直运行良好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM