简体   繁体   English

将 Dataframe 写入 Spark Job 中的 parquet 文件时容器 OOM 问题

[英]Issue of container OOM when writing Dataframe to parquet files in Spark Job

I'm using Machine Learning Workspace in Cloudera Data Platform (CDP).我在 Cloudera 数据平台 (CDP) 中使用机器学习工作区。 I created a session with 4vCPU/16 GiB Memory and enabled Spark 3.2.0.我用 4vCPU/16 GiB Memory 创建了一个 session 并启用了 Spark 3.2.0。

I'm using spark to load data of one month (the whole month data size is around 12 GB) and do some transformation, then write the data as parquet files on AWS S3.我正在使用 spark 加载一个月的数据(整个月的数据大小约为 12 GB)并进行一些转换,然后将数据作为 parquet 文件写入 AWS S3。

My Spark session configuration looks like this:我的 Spark session 配置如下所示:

SparkSession
         .builder
         .appName(appName)
         .config("spark.driver.memory", "8G")
         .config("spark.dynamicAllocation.enabled", "true")
         .config("spark.dynamicAllocation.minExecutors", "4")
         .config("spark.dynamicAllocation.maxExecutors", "20")
         .config("spark.executor.cores", "4")
         .config("spark.executor.memory", "8G")
         .config("spark.sql.shuffle.partitions", 500)
......

Before the data are written to parquet files, they are repartitioned: As @Koedlt suggested, I corrected the "salt" column.在将数据写入 parquet 文件之前,它们会被重新分区:正如@Koedlt 所建议的,我更正了“salt”列。

old:老的:

df.withColumn("salt", lit(random.randrange(100)))
.repartition("date_year", "date_month", "date_day", "salt")
.drop("salt").write.partitionBy("date_year", "date_month")
.mode("overwrite").parquet(SOME__PATH)

new:新的:

df.withColumn("salt", floor(rand() * 100))
.repartition("date_year", "date_month", "date_day", "salt")
.drop("salt").write.partitionBy("date_year", "date_month")
.mode("overwrite").parquet(SOME__PATH)

The data transformation with spark run sucessfully.使用 spark 的数据转换运行成功。 But the spark job failed always in the last step when writing data to parquet files.但是在将数据写入 parquet 文件时,spark 作业总是在最后一步失败。

Below is the example of the error message:以下是错误消息的示例:

23/01/15 21:10:59 678 ERROR TaskSchedulerImpl: Lost executor 2 on 100.100.18.155: 
The executor with id 2 exited with exit code -1(unexpected).
The API gave the following brief reason: Evicted
The API gave the following message: Pod ephemeral local storage usage exceeds the total limit of containers 10Gi. 

I think there is no problem with my spark configuration.我认为我的spark配置没有问题。 The problem is the configuration of kub.nete ephemeral local storage size limitation, which I do not have the right to change it.问题是 kub.nete 临时本地存储大小限制的配置,我无权更改它。

Can some one explain why this happened and what is is possbile solution for it?有人可以解释为什么会这样吗?可能的解决方案是什么?

I see an issue in this line:我在这一行中看到一个问题:

df.withColumn("salt", lit(random.randrange(100)))

What happens when you do this, is random.randrange(100) is evaluated once.当你这样做时会发生什么,是random.randrange(100)被评估一次。 Then you create a literal column with that value repeated constantly.然后,您创建了一个不断重复该值的文字列。 So you're essentially not salting at all, keeping your original data skew problems.所以你基本上根本没有加盐,保留了你的原始数据倾斜问题。 These are possibly at the root of your ephemeral local storage issue.这些可能是您的临时本地存储问题的根源。

You need to use the pyspark.sql.functions.rand function to properly create random columns and to salt.您需要使用pyspark.sql.functions.rand function 正确创建随机列并加盐。

Let's show a small example.让我们举一个小例子。 With the following simple input data:使用以下简单的输入数据:

df = spark.createDataFrame(
    [
        (1, 1, "ABC"),
        (1, 2, "BCD"),
        (1, 3, "DEF"),
        (2, 1, "EFG"),
        (2, 2, "GHI"),
        (2, 3, "HIJ"),
        (3, 1, "EFG"),
        (3, 2, "BCD"),
        (3, 3, "HIJ"),
    ],
    ["KEY", "ORDER", "RESP"]
)

Doing what you were doing:做你正在做的事情:

df.withColumn("salt", lit(random.randrange(100))).show()
+---+-----+----+----+
|KEY|ORDER|RESP|salt|
+---+-----+----+----+
|  1|    1| ABC|  86|
|  1|    2| BCD|  86|
|  1|    3| DEF|  86|
|  2|    1| EFG|  86|
|  2|    2| GHI|  86|
|  2|    3| HIJ|  86|
|  3|    1| EFG|  86|
|  3|    2| BCD|  86|
|  3|    3| HIJ|  86|
+---+-----+----+----+

Whereas using the proper pyspark functions:而使用正确的 pyspark 函数:

df.withColumn("salt", floor(rand() * 100)).show()
+---+-----+----+----+
|KEY|ORDER|RESP|salt|
+---+-----+----+----+
|  1|    1| ABC|  66|
|  1|    2| BCD|  40|
|  1|    3| DEF|  99|
|  2|    1| EFG|  55|
|  2|    2| GHI|  23|
|  2|    3| HIJ|  41|
|  3|    1| EFG|  61|
|  3|    2| BCD|   0|
|  3|    3| HIJ|  33|
+---+-----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM