Spark 内存不足异常

Question

I receive 10 GB file everyday containing employee details.Need to select the latest records from previous day and current day file.Ex: 6th August and 7th August files need to compared on time-Stamp column and select latest records我每天收到 10 GB 包含员工详细信息的文件。需要从前一天和当天的文件中选择最新记录。例如：8 月 6 日和 8 月 7 日的文件需要在时间戳列上进行比较并选择最新记录

6th August File 8 月 6 日文件

emp-id name dept phone-No time-Stamp 1 Jhon Sales 817234518 12-6-2019 2 Marry Production 927234565 4-3-2019 3 James Marketing 625234522 21-1-2019

7th August File 8 月 7 日文件

emp-id name dept phone-No time-Stamp 1 Jhon Sales 817234518 12-7-2019 4 Jerry Sales 653214442 12-7-2019 3 James Marketing 625234522 2-6-2019

Expected output预期输出

emp-id name dept phone-No time-Stamp 1 Jhon Sales 817234518 12-7-2019 2 Marry Production 927234565 4-3-2019 3 James Marketing 625234522 2-5-2019 4 Jerry Sales 653214442 12-7-2019

I tried solutions as below and got expected result.我尝试了以下解决方案并得到了预期的结果。

 val mergedDF = currentDayDF.union(previousDayDF)

mergedDF.show(false)

val windowSpec = Window.partitionBy("emp-id").orderBy(col("timeStamp").desc)
val latestForEachKey = mergedDF.withColumn("rank", rank().over(windowSpec))
                               .filter(col("rank") === 1)
                               .drop("rank")

Questions问题

Every day's input file size is 10 GB, What if cluster memory(total executors memory) is less than 20 GB to load both data sets (previous day and current day) will spark throw Out of Memory exception?每天的输入文件大小为 10 GB，如果集群内存（总执行程序内存）小于 20 GB 加载两个数据集（前一天和当天）会引发抛出 Out of Memory 异常怎么办？
I thought, spark divides large file as partitions to process, hence at the beginning only few partitions are loaded into executors memory, transformations applied and saved intermediate resultant dataset to secondary memory and then process is continued for remaining partitions.But parition by requires all the partitions of data as its wider transformation my guess is wrong.我认为，spark 将大文件划分为要处理的分区，因此一开始只有少数分区加载到执行程序内存中，应用转换并将中间结果数据集保存到二级内存，然后继续处理剩余的分区。但是分区需要所有数据分区作为其更广泛的转换，我的猜测是错误的。 So will spark throw OOM exception?那么spark会抛出OOM异常吗？

Answer 1

Partitions are used for parallel execution.分区用于并行执行。 Spark will try to load all the 20GB data simultaneously in all the available partitions. Spark 将尝试同时加载所有可用分区中的所有 20GB 数据。 If the combined memory of all executors where partitions are created is less than 20 GB then it will throw out of memory error如果创建分区的所有执行程序的总内存小于 20 GB，则会抛出内存不足错误

Spark 内存不足异常

问题描述

1 个解决方案

解决方案1
0 2019-08-07 10:19:04

Spark 内存不足异常

问题描述

1 个解决方案

解决方案1 0 2019-08-07 10:19:04

解决方案1
0 2019-08-07 10:19:04