Spark配置因内存不足错误

Question

Cluster setup - 集群设置-

Driver has 28gb
Workers have 56gb each (8 workers)

Configuration - 配置-

spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g

My job - 我的工作 -

//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.


//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)

//works fine
data.map(s => myFunc(s))

Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. 另外，我从程序中解簇/删除了火花，它在具有56GB内存的单台服务器上完成了很好的工作（成功保存到文件中）。 This shows that it just a spark configuration issue. 这表明这只是一个火花配置问题。 I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. 我查看了https://spark.apache.org/docs/latest/configuration.html#memory-management ，目前看来，所有配置都需要更改才能正常工作。 What else should I be changing? 我还应该改变什么？

Update - 更新-

Data - 数据-

val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)

Note - this does not cause any memory issues, even though it is incredibly inefficient. 注意-这不会引起任何内存问题，即使效率极低。 I can't read from it using spark because it throws some EOF errors. 我无法使用spark读取它，因为它引发了一些EOF错误。

myFunc, I can't really post the code for it because it's complex. myFunc，因为它很复杂，所以我不能真正发布它的代码。 But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. 但基本上，输入字符串是限定字符串，它执行了限定符替换，日期/时间规范化等操作。 The output string will be roughly the same size as an input string. 输出字符串的大小将与输入字符串大致相同。

Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be. 此外，它对于较小的数据大小也可以正常工作，并且输出正确，并且大小应该与输入数据文件大致相同。

Answer 1

Would help if you put more details of what going on in your program before and after the MAP. 如果您在MAP之前和之后添加程序中正在进行的操作的更多详细信息，将有帮助。 Second command (only Map) does not do anything unless an action is triggered. 除非触发了动作，否则第二条命令（仅Map）不会执行任何操作。 Your file is probably not partitioned and driver is doing the work. 您的文件可能未分区，驱动程序正在执行工作。 Below should force data to workers evenly and protect OOM on a single node. 下面应该强制将数据平均分配给工作人员，并在单个节点上保护OOM。 It will cause shuffling of data though. 但是，这将导致数据混排。

Updating solution after looking at your code, will be better if you do this 查看代码后更新解决方案，如果这样做，会更好

val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Answer 2

You current solution does not take advantage of spark. 您当前的解决方案没有利用火花。 You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. 您正在将整个文件加载到内存中的数组中，然后使用sc.parallelize将其分发到RDD中。 This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files. 这极大地浪费了内存（即使没有火花），并且当然会导致大文件的内存不足问题。

Instead, use sc.textFile(filePath) to create your RDD. 而是使用sc.textFile(filePath)创建您的RDD。 Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. 然后spark可以智能地分块读取和处理文件，因此一次只需要一小部分存储在内存中。 You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine. 您还可以通过这种方式利用并行性，因为spark可以并行读取和处理文件，而执行者却很多，而您需要拥有的文件执行器则更多，而无需在单个线程上读取单个文件中的整个文件机。

Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint. 假设myFunc一次只能查看一行，那么该程序应具有非常小的内存占用。

Spark配置因内存不足错误

问题描述

2 个解决方案

解决方案1
0 2018-06-19 19:51:08

解决方案2
0 2018-06-19 20:01:25

Spark配置因内存不足错误

问题描述

2 个解决方案

解决方案1 0 2018-06-19 19:51:08

解决方案2 0 2018-06-19 20:01:25

解决方案1
0 2018-06-19 19:51:08

解决方案2
0 2018-06-19 20:01:25