[英]Spark Scala Serialization Error from RDD map
我有RDD格式RDD [((Long,Long),(Long,Long))]我需要转换或转换为RDD [((Long,Long),(Long,Long,Long,Long))]其中第二个RDD元组基于第一个RDD的函数。
我正在尝试实现这个基于地图的功能,但我认为这里做错了。 请帮我解决这个问题。
这是完整的代码:
package com.ranker.correlation.listitem
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import scala.collection.Map
class ListItemCorrelation(sc: SparkContext) extends Serializable {
def up_down(dirX: Long, dirY: Long): (Long, Long, Long, Long) = {
if (dirX.equals(1)) {
if (dirY.equals(1)) {
return (1, 0, 0, 0)
} else {
return (0, 1, 0, 0)
}
} else {
if (dirY.equals(1)) {
return (0, 0, 1, 0)
} else {
return (0, 0, 0, 1)
}
}
}
def run(votes: String): RDD[((Long, Long), (Long, Long, Long, Long))] = {
val userVotes = sc.textFile(votes)
val userVotesPairs = userVotes.map { t =>
val p = t.split(",")
(p(0).toLong, (p(1).toLong, p(2).toLong))
}
val jn = userVotesPairs.join(userVotesPairs).values.filter(t => t._1._1.<(t._2._1))
val first = jn.map(t => ((t._1._1, t._2._1), (t._1._2, t._2._2)))
var second = first.map(t => ((t._1._1, t._2._1), up_down(t._1._2, t._2._2)))
//More functionality
return result
}
}
object ListItemCorrelation extends Serializable {
def main(args: Array[String]) {
val votes = args(0)
val conf = new SparkConf().setAppName("SparkJoins").setMaster("local")
val context = new SparkContext(conf)
val job = new ListItemCorrelation(context)
val results = job.run(votes)
val output = args(1)
results.saveAsTextFile(output)
context.stop()
}
}
当我尝试运行此脚本时,我收到以下错误:
线程“main”org.apache.spark.SparkException中的异常:org.apache.spark.util.ClosureCleaner $ .org中的org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:298)中的任务不可序列化$ apache $ spark $ util $ ClosureCleaner $$ clean(ClosureCleaner.scala:288)org.apache.spark.util.ClosureCleaner $ .clean(ClosureCleaner.scala:108)at org.apache.spark.SparkContext.clean(SparkContext .scala:2094)org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD.scala:370)at org.apache.spark.rdd.RDD $$ anonfun $ map $ 1.apply(RDD .scala:369)org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)位于org.apache的org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)。 spark.rdd.RDD.withScope(RDD.scala:362)位于org.apache.spark.rdd.RDD.map(RDD.scala:369)的com.ranker.correlation.listitem.ListItemCorrelation.run(ListItemCorrelation.scala: 34)在com.ranke的com.ranker.correlation.listitem.ListItemCorrelation $ .main(ListItemCorrelation.scala:47) r.correlation.listitem.ListItemCorrelation.main(ListItemCorrelation.scala)引起:java.io.NotSerializableException:org.apache.spark.SparkContext序列化堆栈: - 对象不可序列化(类:org.apache.spark.SparkContext,值: org.apache.spark.SparkContext@4248e66b) - field(类:com.ranker.correlation.listitem.ListItemCorrelation,name:sc,type:class org.apache.spark.SparkContext) - object(com.ranker.correlation类。 listitem.ListItemCorrelation,com.ranker.correlation.listitem.ListItemCorrelation@270b6b5e) - field(类:com.ranker.correlation.listitem.ListItemCorrelation $$ anonfun $ 4,name:$ outer,type:class com.ranker.correlation.listitem .ListItemCorrelation) - 在org.apache.spark.serializer的org.apache.spark.serializer.SerializationDebugger $ .improveException(SerializationDebugger.scala:40)中的对象(类com.ranker.correlation.listitem.ListItemCorrelation $$ anonfun $ 4,)在org.apache.spark.serializ的.JavaSerializationStream.writeObject(JavaSerializer.scala:46) er.JavaSerializerInstance.serialize(JavaSerializer.scala:100)at org.apache.spark.util.ClosureCleaner $ .ensureSerializable(ClosureCleaner.scala:295)... 12更多
执行以下行时发生此错误:
var second = first.map(t =>((t._1._1,t._2._1),up_down(t._1._2,t._2._2)))
我是scala的新手,请帮我找到正确的方法。
将up_down
方法放在伴随对象上。 当在RDD闭包内访问任何类变量时,类(及其中的所有内容,如SparkContext)将被序列化。 方法参数在此计为类变量。 使用静态对象可以解决这个问题:
package com.ranker.correlation.listitem
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.rdd._
import scala.collection.Map
object ListItemCorrelation {
def up_down(dirX: Long, dirY: Long): (Long, Long, Long, Long) = {
if (dirX.equals(1)) {
if (dirY.equals(1)) {
return (1, 0, 0, 0)
} else {
return (0, 1, 0, 0)
}
} else {
if (dirY.equals(1)) {
return (0, 0, 1, 0)
} else {
return (0, 0, 0, 1)
}
}
}
}
class ListItemCorrelation(sc: SparkContext) extends Serializable {
def run(votes: String): RDD[((Long, Long), (Long, Long, Long, Long))] = {
val userVotes = sc.textFile(votes)
val userVotesPairs = userVotes.map { t =>
val p = t.split(",")
(p(0).toLong, (p(1).toLong, p(2).toLong))
}
val jn = userVotesPairs.join(userVotesPairs).values.filter(t => t._1._1.<(t._2._1))
val first = jn.map(t => ((t._1._1, t._2._1), (t._1._2, t._2._2)))
var second = first.map(t => ((t._1._1, t._2._1), ListItemCorrelation.up_down(t._1._2, t._2._2)))
//More functionality
return result
}
}
object ListItemCorrelation extends Serializable {
def main(args: Array[String]) {
val votes = args(0)
val conf = new SparkConf().setAppName("SparkJoins").setMaster("local")
val context = new SparkContext(conf)
val job = new ListItemCorrelation(context)
val results = job.run(votes)
val output = args(1)
results.saveAsTextFile(output)
context.stop()
}
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.