小文本数据上的Spark OutOfMemory错误

Question

I am working on implementing an algorithm and testing it on medium-sized data in Spark (the Scala interface) on a local node. 我正在努力实现一种算法，并在本地节点上的Spark（Scala接口）中的中型数据上对其进行测试。 I am starting with very simple processing and I'm getting java.lang.OutOfMemoryError: Java heap space even though I'm pretty sure the data isn't big enough for such an error to be reasonable. 我从非常简单的处理开始，尽管我非常确定数据不足以使此类错误合理，但我正在获取java.lang.OutOfMemoryError: Java heap space 。 Here is the minimal breaking code: 这是最小的破坏代码：

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkContext, SparkConf}

val conf = new SparkConf()
  .setMaster("local[4]")
  .setAppName("AdultProcessing")
  .set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)

val dataFile = "data/census/processed/census-income.data"
val censusData: RDD[String] = sc.textFile(dataFile, 4)
val censusDataPreprocessed = censusData.map { row =>
  val separators: Array[Char] = ":,".toCharArray
  row.split(separators)
}

val res = censusDataPreprocessed.collect()

the data I'm using is the classic census data , uncompressed. 我使用的数据是未经压缩的经典人口普查数据。 It's 100MB and almost 200k rows. 这是100MB，近20万行。 The amount of memory on my machine should be more than sufficient: 我的机器上的内存量应足够：

nietaki@xebab$ free -tm
             total       used       free     shared    buffers     cached
Mem:         15495      12565       2929          0        645       5608
-/+ buffers/cache:       6311       9183
Swap:         3858          0       3858
Total:       19354      12566       6788

The chunks of the data file are under 30MB for each of the virtual nodes and the only processing I'm performing is splitting row strings into arrays of under 50 items. 每个虚拟节点的数据文件块都小于30MB，而我正在执行的唯一处理是将行字符串拆分为50个以下项的数组。 I can't believe this operation alone should use up the memory. 我简直不敢相信此操作会耗尽内存。

While trying to debug the situation I have found that reducing the number of nodes to 1, or, alternatively, increasing the SparkContext.textFile() 's minPartitions argument from 4 to 8 for example cures the situation, but it doesn't make me any wiser. 在尝试调试情况时，我发现将节点数减少到1，或者将SparkContext.textFile()的minPartitions参数从4增加到例如8可以解决这种情况，但这并不能解决我的问题任何明智的。

I'm using Spark 1.0.0 and Scala 2.10.4. 我正在使用Spark 1.0.0和Scala 2.10.4。 I am launching the project directly from sbt: sbt run -Xmx2g -Xms2g . 我直接从sbt启动项目： sbt run -Xmx2g -Xms2g 。

Answer 1

The JVM is memory hungry. JVM内存不足。 Spark runs on the JVM. Spark在JVM上运行。

I'd recommend you to inspect the heap with a profiler to find out the actual memory used by your records. 我建议您使用探查器检查堆，以找出记录所使用的实际内存。 In my case they were 2x the size "at rest" and they were a combination of primitive types and Strings. 在我的情况下，它们的大小是“静止”大小的2倍，并且它们是原始类型和字符串的组合。

In your case, Strings are particularly mem-eaters. 在您的情况下，字符串尤其容易吃。 "" (the empty string) is ~40 bytes long - longer strings offset the cost of the structure. "" （空字符串）的长度约为40个字节-较长的字符串可以抵消结构的成本。 see [1] 见[1]

Applying the table in the previous resource to your data: 将先前资源中的表应用于数据：

line: String = 73, Not in universe, 0, 0, High school graduate, 0, Not in universe, Widowed, Not in universe or children, Not in universe, White, All other, Female, Not in universe, Not in universe, Not in labor force, 0, 0, 0, Nonfiler, Not in universe, Not in universe, Other Rel 18+ ever marr not in subfamily, Other relative of householder, 1700.09, ?, ?, ?, Not in universe under 1 year old, ?, 0, Not in universe, United-States, United-States, United-States, Native- Born in the United States, 0, Not in universe, 2, 0, 95, - 50000.

line.size
Int = 523
def size(s:String) = s.size/4*8+40
line.split(",.").map(w=>size(w)).sum
Int = 2432

So, thanks to all those small strings, your data in memory is roughly 5x the size 'at rest'. 因此，由于所有这些小字符串，您的内存中数据大约是“静态”大小的5倍。 Still, 200k lines of that data makes up for roughly 500MB. 尽管如此，该数据的200k行仍约占500MB。 This might indicate that your executor is operating at the default valie of 512MB. 这可能表明您的执行器以默认值512MB运行。 Try setting 'spark.executor.memory' to a higher value, but also consider a heap size >8Gb to confortably work with Spark. 尝试将“ spark.executor.memory”设置为更高的值，但也要考虑大于8Gb的堆大小，以方便使用Spark。

小文本数据上的Spark OutOfMemory错误

问题描述

1 个解决方案

解决方案1
0 2014-06-24 19:54:14

小文本数据上的Spark OutOfMemory错误

问题描述

1 个解决方案

解决方案1 0 2014-06-24 19:54:14

解决方案1
0 2014-06-24 19:54:14