独立 Pyspark 错误：打开的文件太多

Question

我有一个 ~40Gb（~80m 记录，仅 2 列，文本）数据，并对数据进行了不同的计数。 我可以在 AWS 上的 r5a.4xlarge 实例上成功运行它。 大约需要。 3分钟返回结果。 但是，当我将实例更改为更大的实例 r5a.12xlarge 时，运行相同的代码时出现“打开的文件过多”错误。 我为火花 session 尝试了几种不同的配置，但都没有奏效。 此外，我将打开文件的 LINUX 限制增加到 4096，没有变化。 下面是代码和错误的第一部分。

spark = (SparkSession
    .builder
    .appName('Project_name')
        .config('spark.executor.memory', "42G") #Tried 19G to 60G
        .config('spark.executor.instances', "4") #Tried 1 to 5 
        .config('spark.executor.cores', "4") #Tried 1 to 5 
        .config("spark.dynamicAllocation.enabled", "true") #Also tried without dynamic allocation
        .config("spark.dynamicAllocation.minExecutors","1")
        .config("spark.dynamicAllocation.maxExecutors","5")
        .config('spark.driver.memory', "42G") #Tried 19G to 60G
        .config('spark.driver.maxResultSize', '10G') #Tried 1G to 10G
    .config('spark.worker.cleanup.enabled', 'True')
    .config("spark.local.dir", "/tmp/spark-temp")
    .getOrCreate())

错误：

>>> data.select(f.countDistinct("column_name")).show()

Py4JJavaError: An error occurred while calling o315.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 20 in stage 5.0 failed 1 times, most recent failure: Lost task 20.0 in stage 5.0 (TID 64, localhost, executor driver): java.io.FileNotFoundException: /tmp/spark-temp/blockmgr-c2f18891-a868-42ba-9075-dc145faaa4c4/16/temp_shuffle_f9c96d48-336d-423a-9edd-dcb9af5705a7 (Too many open files)

有什么想法吗？

Answer 1

由于它是一个巨大的文件，当 spark 读取该文件时，它会为该文件创建 292 个（292*128MB ~ 40G）分区。 默认情况下，spark 有 spark.sql.shuffle.partitions=200。 因此，您只需将此数字增加到高于分区数的数字即可。 此外，您可以将文件缓存在 memory 中以获得更好的性能。

spark = (SparkSession
    .builder
    .appName('Project_name')
    .config('spark.executor.memory', "20G") 
    .config('spark.driver.memory', "20G") 
    .config('spark.driver.maxResultSize', '10G') 
    .config('spark.sql.shuffle.partitions',300) # Increasing SQL shuffle partitions
    .config('spark.worker.cleanup.enabled', 'True')
    .config("spark.local.dir", "/tmp/spark-temp")
    .getOrCreate())

>>> data.select(f.countDistinct("column_name")).show() # Result in ~2min

独立 Pyspark 错误：打开的文件太多

问题描述

1 个解决方案

解决方案1
1 2020-04-21 15:52:52

独立 Pyspark 错误：打开的文件太多

问题描述

1 个解决方案

解决方案1 1 2020-04-21 15:52:52

解决方案1
1 2020-04-21 15:52:52