Spark configuration for Out of memory error

Question

Cluster setup -

Driver has 28gb
Workers have 56gb each (8 workers)

Configuration -

spark.memory.offHeap.enabled true
spark.driver.memory 20g
spark.memory.offHeap.size 16gb
spark.executor.memory 40g

My job -

//myFunc just takes a string s and does some transformations on it, they are very small strings, but there's about 10million to process.


//Out of memory failure
data.map(s => myFunc(s)).saveAsTextFile(outFile)

//works fine
data.map(s => myFunc(s))

Also, I de-clustered / removed spark from my program and it completed just fine(successfully saved to a file) on a single server with 56gb of ram. This shows that it just a spark configuration issue. I reviewed https://spark.apache.org/docs/latest/configuration.html#memory-management and the configurations I currently have seem to be all that should be needed to be changed for my job to work. What else should I be changing?

Update -

Data -

val fis: FileInputStream = new FileInputStream(new File(inputFile))
val bis: BufferedInputStream = new BufferedInputStream(fis);
val input: CompressorInputStream = new CompressorStreamFactory().createCompressorInputStream(bis);
br = new BufferedReader(new InputStreamReader(input))
val stringArray = br.lines().toArray()
val data = sc.parallelize(stringArray)

Note - this does not cause any memory issues, even though it is incredibly inefficient. I can't read from it using spark because it throws some EOF errors.

myFunc, I can't really post the code for it because it's complex. But basically, the input string is a deliminated string, it does some deliminator replacement, date/time normalizing and things like that. The output string will be roughly the same size as an input string.

Also, it works fine for smaller data sizes, and the output is correct and roughly the same size as input data file, as it should be.

Answer 1

Would help if you put more details of what going on in your program before and after the MAP. Second command (only Map) does not do anything unless an action is triggered. Your file is probably not partitioned and driver is doing the work. Below should force data to workers evenly and protect OOM on a single node. It will cause shuffling of data though.

Updating solution after looking at your code, will be better if you do this

val data = sc.parallelize(stringArray).repartition(8)
data.map(s => myFunc(s)).saveAsTextFile(outFile)

Answer 2

You current solution does not take advantage of spark. You are loading the entire file into an array in memory, then using sc.parallelize to distribute it into an RDD. This is hugely wasteful of memory (even without spark) and will of course cause out of memory problems for large files.

Instead, use sc.textFile(filePath) to create your RDD. Then spark is able to smartly read and process the file in chunks, so only a small portion of it needs to be in memory at a time. You are also able to take advantage of parallelism this way, as spark will be able to read and process the file in parallel, with however many executors and corse your have, instead of needing the read the entire file on a single thread on a single machine.

Assuming that myFunc can look at only a single line at a time, then this program should have a very small memory footprint.

Spark configuration for Out of memory error

Question

2 answers

solution1
0 2018-06-19 19:51:08

solution2
0 2018-06-19 20:01:25

Spark configuration for Out of memory error

Question

2 answers

solution1 0 2018-06-19 19:51:08

solution2 0 2018-06-19 20:01:25

solution1
0 2018-06-19 19:51:08

solution2
0 2018-06-19 20:01:25