简体   繁体   English

Spark抛出内存不足错误

[英]Spark throwing Out of Memory error

I have a single test node with 8 GB ram on which I am loading barely 10 MB of data(from csv files) into Cassandra(on the same node itself).我有一个带有 8 GB ram 的测试节点,我将几乎 10 MB 的数据(来自 csv 文件)加载到 Cassandra(在同一节点本身上)。 Im trying to process this data using spark(running on the same node).我正在尝试使用 spark(在同一节点上运行)处理这些数据。

Please note that for SPARK_MEM, Im allocating 1 GB of RAM and SPARK_WORKER_MEMORY I'm allocating the same.请注意,对于 SPARK_MEM,我分配了 1 GB 的 RAM,而 SPARK_WORKER_MEMORY 我正在分配相同的内存。 The allocation of any extra amount of memory results in spark throwing a "Check if all workers are registered and have sufficient memory error", which is more often than not indicative of Spark trying to look for extra memory(as per SPARK_MEM and SPARK_WORKER_MEMORY properties) and coming up short.任何额外内存量的分配都会导致 spark 抛出“检查所有 worker 是否已注册并有足够的内存错误”,这通常表明 Spark 试图寻找额外的内存(根据 SPARK_MEM 和 SPARK_WORKER_MEMORY 属性)和即将到来。

When I try to load and process all data in the Cassandra table using spark context object, I'm getting an error during processing.当我尝试使用 spark 上下文对象加载和处理 Cassandra 表中的所有数据时,在处理过程中出现错误。 So, I'm trying to use a looping mechanism to read chunks of data at a time from one table, process them and put them in another table.所以,我试图使用循环机制一次从一个表中读取数据块,处理它们并将它们放入另一个表中。

My source code has the following structure我的源代码具有以下结构

var data=sc.cassandraTable("keyspacename","tablename").where("value=?",1)
data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")

for(i<-2 to 50000){
    data=sc.cassandraTable("keyspacename","tablename").where("value=?",i)
    data.map(x=>tranformFunction(x)).saveToCassandra("keyspacename","tablename")    
}

Now, this works for a while, for around 200 loops, and then this throws an error: java.lang.OutOfMemoryError: unable to create a new native thread.现在,这工作了一段时间,大约 200 个循环,然后抛出一个错误:java.lang.OutOfMemoryError:无法创建新的本机线程。

I've got two questions:我有两个问题:

Is this the right way to deal with data?
How can processing just 10 MB of data do this to a cluster?

You are running a query inside the for loop.您正在 for 循环内运行查询。 If the 'value' column is not a key/indexed column, Spark will load the table into memory and then filter on the value.如果“值”列不是键/索引列,Spark 会将表加载到内存中,然后根据值进行过滤。 This will certainly cause an OOM.这肯定会导致OOM。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM