Spark抛出内存不足错误

Question

I have a single test node with 8 GB ram on which I am loading barely 10 MB of data(from csv files) into Cassandra(on the same node itself).我有一个带有 8 GB ram 的测试节点，我将几乎 10 MB 的数据（来自 csv 文件）加载到 Cassandra（在同一节点本身上）。 Im trying to process this data using spark(running on the same node).我正在尝试使用 spark（在同一节点上运行）处理这些数据。

Please note that for SPARK_MEM, Im allocating 1 GB of RAM and SPARK_WORKER_MEMORY I'm allocating the same.请注意，对于 SPARK_MEM，我分配了 1 GB 的 RAM，而 SPARK_WORKER_MEMORY 我正在分配相同的内存。 The allocation of any extra amount of memory results in spark throwing a "Check if all workers are registered and have sufficient memory error", which is more often than not indicative of Spark trying to look for extra memory(as per SPARK_MEM and SPARK_WORKER_MEMORY properties) and coming up short.任何额外内存量的分配都会导致 spark 抛出“检查所有 worker 是否已注册并有足够的内存错误”，这通常表明 Spark 试图寻找额外的内存（根据 SPARK_MEM 和 SPARK_WORKER_MEMORY 属性）和即将到来。

When I try to load and process all data in the Cassandra table using spark context object, I'm getting an error during processing.当我尝试使用 spark 上下文对象加载和处理 Cassandra 表中的所有数据时，在处理过程中出现错误。 So, I'm trying to use a looping mechanism to read chunks of data at a time from one table, process them and put them in another table.所以，我试图使用循环机制一次从一个表中读取数据块，处理它们并将它们放入另一个表中。

My source code has the following structure我的源代码具有以下结构

var data=sc.cassandraTable("keyspacename","tablename").where("value=?",1)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")

for(i<-2 to 50000){
    data=sc.cassandraTable("keyspacename","tablename").where("value=?",i)
    data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")    
}

Now, this works for a while, for around 200 loops, and then this throws an error: java.lang.OutOfMemoryError: unable to create a new native thread.现在，这工作了一段时间，大约 200 个循环，然后抛出一个错误：java.lang.OutOfMemoryError：无法创建新的本机线程。

I've got two questions:我有两个问题：

Is this the right way to deal with data?
How can processing just 10 MB of data do this to a cluster?

Answer 1

You are running a query inside the for loop.您正在 for 循环内运行查询。 If the 'value' column is not a key/indexed column, Spark will load the table into memory and then filter on the value.如果“值”列不是键/索引列，Spark 会将表加载到内存中，然后根据值进行过滤。 This will certainly cause an OOM.这肯定会导致OOM。

Spark抛出内存不足错误

问题描述

1 个解决方案

解决方案1
1 2014-08-21 23:50:55

Spark抛出内存不足错误

问题描述

1 个解决方案

解决方案1 1 2014-08-21 23:50:55

解决方案1
1 2014-08-21 23:50:55