简体   繁体   English

面对 spark 上小数据集的大数据溢出

[英]Facing large data spills for small datasets on spark

I am trying to run some spark sql on NOA datasets available here:我正在尝试在此处提供的 NOA 数据集上运行一些 spark sql:

https://www.ncei.noaa.gov/data/global-summary-of-the-day/access/2021/

I am trying to run some query which involves grouping and sorting.我正在尝试运行一些涉及分组和排序的查询。

df
      .groupBy("COUNTRY_FULL")
      .agg(max("rank"), last("consecutive").as("consecutive"))
      .withColumn("maxDays", maxDaysTornodoUdf(col("consecutive")))
      .sort(col("maxDays").desc)
      .limit(1)
      .show()

The input size is just 50 MB zipped csvs and I am running this locally (4 cores).输入大小仅为 50 MB 压缩 csvs,我在本地运行它(4 核)。 These are the settings I use.这些是我使用的设置。

  • spark.driver.memory: 14g spark.driver.memory:14g
  • spark.sql.windowExec.buffer.in.memory.threshold: 20000 spark.sql.windowExec.buffer.in.memory.threshold: 20000
  • spark.sql.windowExec.buffer.spill.threshold: 20000 spark.sql.windowExec.buffer.spill.threshold: 20000
  • spark.sql.shuffle.partitions: 400 spark.sql.shuffle.partitions: 400

I see too many disk spills for such a small data对于这么小的数据,我看到太多的磁盘溢出

21/08/16 10:23:13 INFO UnsafeExternalSorter: Thread 54 spilling sort data of 416.0 MB to disk (371  times so far)
21/08/16 10:23:14 INFO UnsafeExternalSorter: Thread 79 spilling sort data of 416.0 MB to disk (130  times so far)
21/08/16 10:23:14 INFO UnsafeExternalSorter: Thread 53 spilling sort data of 400.0 MB to disk (240  times so far)
21/08/16 10:23:14 INFO UnsafeExternalSorter: Thread 69 spilling sort data of 400.0 MB to disk (24  times so far)
21/08/16 10:23:16 INFO UnsafeExternalSorter: Thread 54 spilling sort data of 416.0 MB to disk (372  times so far)
21/08/16 10:23:16 INFO UnsafeExternalSorter: Thread 79 spilling sort data of 416.0 MB to disk (131  times so far)

However, when I check the Spark UI, the spillage doesnt seem to be much但是,当我检查 Spark UI 时,溢出似乎并不多

在此处输入图像描述

Eventually the spark job terminates with error Not Enough memory I do not understand what is happening.最终,spark 作业以错误Not Enough memory终止,我不明白发生了什么。

You are using 400 as spark.sql.shuffle.partitions , which is too much for the data size which you are dealing with.您将 400 用作spark.sql.shuffle.partitions ,这对于您正在处理的数据大小来说太多了。

Having more shuffle partitions for lesser amount of data causes more partitions/tasks and it will reduce the performance.为更少的数据量设置更多的 shuffle 分区会导致更多的分区/任务,并且会降低性能。 Read best practices to configure shuffle partition here . 在此处阅读配置随机分区的最佳实践。

Try reducing shuffle partitions.尝试减少随机分区。 You may try setting it to spark.sparkContext.defaultParallelism .您可以尝试将其设置为spark.sparkContext.defaultParallelism

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM