简体   繁体   中英

Spark OutOfMemory error on small text data

I am working on implementing an algorithm and testing it on medium-sized data in Spark (the Scala interface) on a local node. I am starting with very simple processing and I'm getting java.lang.OutOfMemoryError: Java heap space even though I'm pretty sure the data isn't big enough for such an error to be reasonable. Here is the minimal breaking code:

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf()
  .setMaster("local[4]")
  .setAppName("AdultProcessing")
  .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)

val dataFile = "data/census/processed/census-income.data"
val censusData: RDD[String] = sc.textFile(dataFile, 4)
val censusDataPreprocessed = censusData.map { row =>
  val separators: Array[Char] = ":,".toCharArray
  row.split(separators)
}

val res = censusDataPreprocessed.collect()

the data I'm using is the classic census data , uncompressed. It's 100MB and almost 200k rows. The amount of memory on my machine should be more than sufficient:

nietaki@xebab$ free -tm
             total       used       free     shared    buffers     cached
Mem:         15495      12565       2929          0        645       5608
-/+ buffers/cache:       6311       9183
Swap:         3858          0       3858
Total:       19354      12566       6788

The chunks of the data file are under 30MB for each of the virtual nodes and the only processing I'm performing is splitting row strings into arrays of under 50 items. I can't believe this operation alone should use up the memory.

While trying to debug the situation I have found that reducing the number of nodes to 1, or, alternatively, increasing the SparkContext.textFile() 's minPartitions argument from 4 to 8 for example cures the situation, but it doesn't make me any wiser.

I'm using Spark 1.0.0 and Scala 2.10.4. I am launching the project directly from sbt: sbt run -Xmx2g -Xms2g .

The JVM is memory hungry. Spark runs on the JVM.

I'd recommend you to inspect the heap with a profiler to find out the actual memory used by your records. In my case they were 2x the size "at rest" and they were a combination of primitive types and Strings.

In your case, Strings are particularly mem-eaters. "" (the empty string) is ~40 bytes long - longer strings offset the cost of the structure. see [1]

Applying the table in the previous resource to your data:

line: String = 73, Not in universe, 0, 0, High school graduate, 0, Not in universe, Widowed, Not in universe or children, Not in universe, White, All other, Female, Not in universe, Not in universe, Not in labor force, 0, 0, 0, Nonfiler, Not in universe, Not in universe, Other Rel 18+ ever marr not in subfamily, Other relative of householder, 1700.09, ?, ?, ?, Not in universe under 1 year old, ?, 0, Not in universe, United-States, United-States, United-States, Native- Born in the United States, 0, Not in universe, 2, 0, 95, - 50000.

line.size
Int = 523
def size(s:String) = s.size/4*8+40
line.split(",.").map(w=>size(w)).sum
Int = 2432

So, thanks to all those small strings, your data in memory is roughly 5x the size 'at rest'. Still, 200k lines of that data makes up for roughly 500MB. This might indicate that your executor is operating at the default valie of 512MB. Try setting 'spark.executor.memory' to a higher value, but also consider a heap size >8Gb to confortably work with Spark.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM