为什么 Spark 作业会因“打开的文件太多”而失败？

Question

I get "too many open files" during the shuffle phase of my Spark job.在我的 Spark 作业的 shuffle 阶段，我收到“太多打开的文件”。 Why is my job opening so many files?为什么我的工作打开这么多文件？ What steps can I take to try to make my job succeed.我可以采取哪些步骤来尝试使我的工作取得成功。

Answer 1

This has been answered on the spark user list : 这已在 spark 用户列表中得到解答：

The best way is definitely just to increase the ulimit if possible, this is sort of an assumption we make in Spark that clusters will be able to move it around.最好的方法绝对是尽可能增加 ulimit，这是我们在 Spark 中做出的一种假设，即集群将能够移动它。

You might be able to hack around this by decreasing the number of reducers [or cores used by each node] but this could have some performance implications for your job.您可以通过减少减速器 [或每个节点使用的内核] 的数量来解决这个问题，但这可能会对您的工作产生一些性能影响。

In general if a node in your cluster has C assigned cores and you run a job with X reducers then Spark will open C*X files in parallel and start writing.通常，如果集群中的节点具有 C 个分配的内核，并且您使用 X 个 reducer 运行作业，那么 Spark 将并行打开 C*X 文件并开始写入。 Shuffle consolidation will help decrease the total number of files created but the number of file handles open at any time doesn't change so it won't help the ulimit problem.随机合并将有助于减少创建的文件总数，但任何时候打开的文件句柄数不会改变，因此它不会帮助解决 ulimit 问题。

-Patrick Wendell -帕特里克温德尔

Answer 2

the default ulimit is 1024 which is ridiculously low for large scale applications.默认的 ulimit 是 1024，这对于大规模应用程序来说是低得离谱。 HBase recommends up to 64K; HBase 建议最高 64K； modern linux systems don't seem to have trouble with this many open files.现代 linux 系统似乎不会遇到这么多打开的文件的问题。

use用

ulimit -a

to see your current maximum number of open files查看您当前打开的最大文件数

ulimit -n

can temporarily change the number of open files;可以临时更改打开文件的数量； you need to update the system configuration files and per-user limits to make this permanent.您需要更新系统配置文件和每用户限制以使其永久化。 On CentOS and RedHat systems, that can be found in在 CentOS 和 RedHat 系统上，可以在

/etc/sysctl.conf
/etc/security/limits.conf

Answer 3

Another solution for this error is reducing your partitions.此错误的另一个解决方案是减少分区。

check to see if you've got a lot of partitions with:检查您是否有很多分区：

someBigSDF.rdd.getNumPartitions()

Out[]: 200

#if you need to persist the repartition, do it like this
someBigSDF = someBigSDF.repartition(20)

#if you just need it for one transformation/action, 
#you can do the repartition inline like this
someBigSDF.repartition(20).groupBy("SomeDt").agg(count("SomeQty")).orderBy("SomeDt").show()

为什么 Spark 作业会因“打开的文件太多”而失败？

问题描述

3 个解决方案

解决方案1
28 已采纳 2014-09-07 06:17:37

解决方案2
13 2014-09-08 16:00:26

解决方案3
2 2018-04-20 14:00:01

为什么 Spark 作业会因“打开的文件太多”而失败？

问题描述

3 个解决方案

解决方案1 28 已采纳 2014-09-07 06:17:37

解决方案2 13 2014-09-08 16:00:26

解决方案3 2 2018-04-20 14:00:01

解决方案1
28 已采纳 2014-09-07 06:17:37

解决方案2
13 2014-09-08 16:00:26

解决方案3
2 2018-04-20 14:00:01