[英]Apache Spark - Simple Word Count gets: SparkException: Task not serializable
i'm trying to do some tests on Apache Spark (v1.3.0), i have a simple Java 8 class: 我正在尝试在Apache Spark(v1.3.0)上进行一些测试,我有一个简单的Java 8类:
public class WordCount {
private JavaSparkContext ctx;
private String inputFile, outputFile;
public WordCount(String inputFile, String outputFile) {
this.inputFile = inputFile;
this.outputFile = outputFile;
// Initialize Spark Conf
ctx = new JavaSparkContext("local", "WordCount",
System.getenv("SPARK_HOME"), System.getenv("JARS"));
}
public static void main(String... args) {
String inputFile = "/home/workspace/spark/src/main/resources/inferno.txt";//args[0];
String outputFile = "/home/workspace/spark/src/main/resources/dv";//args[1];
WordCount wc = new WordCount(inputFile, outputFile);
wc.doWordCount();
wc.close();
}
public void doWordCount() {
long start = System.currentTimeMillis();
JavaRDD<String> inputRdd = ctx.textFile(inputFile);
JavaPairRDD<String, Integer> count = inputRdd.flatMapToPair((String s) -> {
List<Tuple2<String, Integer>> list = new ArrayList<>();
Arrays.asList(s.split(" ")).forEach(s1 -> list.add(new Tuple2<String, Integer>(s1, 1)));
return list;
}).reduceByKey((x, y) -> x + y);
List<Tuple2<String, Integer>> list = count.takeOrdered(10,
(o1, o2) -> o2._2() - o1._2());
list.forEach(t2 -> System.out.println(t2._1()));
// count.saveAsTextFile(outputFile);
long end = System.currentTimeMillis();
System.out.println(String.format("Time in ms is: %d", end - start));
}
public void close() {
ctx.stop();
}
}
When i run it get the following exception: 当我运行它时,出现以下异常:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1622)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:635)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1231)
at org.apache.spark.api.java.JavaRDDLike$class.takeOrdered(JavaRDDLike.scala:578)
at org.apache.spark.api.java.JavaPairRDD.takeOrdered(JavaPairRDD.scala:45)
at it.conker.spark.base.WordCount.doWordCount2(WordCount.java:65)
at it.conker.spark.base.WordCount.main(WordCount.java:37)
Caused by: java.io.NotSerializableException: it.conker.spark.base.WordCount$$Lambda$10/1541232265
Serialization stack:
- object not serializable (class: it.conker.spark.base.WordCount$$Lambda$10/1541232265, value: it.conker.spark.base.WordCount$$Lambda$10/1541232265@213860b8)
- field (class: scala.math.LowPriorityOrderingImplicits$$anon$7, name: cmp$2, type: interface java.util.Comparator)
- object (class scala.math.LowPriorityOrderingImplicits$$anon$7, scala.math.LowPriorityOrderingImplicits$$anon$7@511505e7)
- field (class: org.apache.spark.rdd.RDD$$anonfun$34, name: ord$1, type: interface scala.math.Ordering)
- object (class org.apache.spark.rdd.RDD$$anonfun$34, <function1>)
- field (class: org.apache.spark.rdd.RDD$$anonfun$14, name: f$3, type: interface scala.Function1)
- object (class org.apache.spark.rdd.RDD$$anonfun$14, <function3>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:38)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 8 more
This is mi pom file: 这是mi pom文件:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<groupId>it.conker.spark</groupId>
<artifactId>learning-spark-by-example</artifactId>
<modelVersion>4.0.0</modelVersion>
<name>Learning Spark by example</name>
<packaging>jar</packaging>
<version>0.0.1</version>
<dependencies>
<dependency> <!-- Spark dependency -->
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.3.0</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.11</version>
<scope>test</scope>
</dependency>
</dependencies>
<properties>
<java.version>1.8</java.version>
</properties>
<build>
<pluginManagement>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.1</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
</plugin>
</plugins>
</pluginManagement>
</build>
</project>
I run the class within eclipse. 我在Eclipse中运行课程。 Can someone tell me where i was wrong?
有人可以告诉我我错了吗?
Edit: As mark91 ask, replacing the the line: 编辑:按mark91的要求,替换行:
List<Tuple2<String, Integer>> list = count.takeOrdered(10,
(o1, o2) -> o2._2() - o1._2());
with: 与:
List<Tuple2<String, Integer>> list = count.takeOrdered(10);
I have got this exception: 我有这个例外:
java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
at org.spark-project.guava.collect.NaturalOrdering.compare(NaturalOrdering.java:28)
at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:672)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD$$anonfun$34.apply(RDD.scala:1234)
at org.apache.spark.rdd.RDD$$anonfun$34.apply(RDD.scala:1231)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/27 17:58:55 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
at org.spark-project.guava.collect.NaturalOrdering.compare(NaturalOrdering.java:28)
at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:672)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD$$anonfun$34.apply(RDD.scala:1234)
at org.apache.spark.rdd.RDD$$anonfun$34.apply(RDD.scala:1231)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
15/03/27 17:58:55 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job
15/03/27 17:58:55 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
15/03/27 17:58:55 INFO TaskSchedulerImpl: Cancelling stage 1
15/03/27 17:58:56 INFO DAGScheduler: Job 0 failed: takeOrdered at WordCount.java:66, took 14.117721 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.lang.ClassCastException: scala.Tuple2 cannot be cast to java.lang.Comparable
at org.spark-project.guava.collect.NaturalOrdering.compare(NaturalOrdering.java:28)
at scala.math.LowPriorityOrderingImplicits$$anon$7.compare(Ordering.scala:153)
at org.apache.spark.util.collection.Utils$$anon$1.compare(Utils.scala:35)
at org.spark-project.guava.collect.Ordering.leastOf(Ordering.java:672)
at org.apache.spark.util.collection.Utils$.takeOrdered(Utils.scala:37)
at org.apache.spark.rdd.RDD$$anonfun$34.apply(RDD.scala:1234)
at org.apache.spark.rdd.RDD$$anonfun$34.apply(RDD.scala:1231)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.RDD$$anonfun$14.apply(RDD.scala:634)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:64)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:203)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1203)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1191)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1191)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Ok, the reason is that all classes you use in your precessing (ie objects stored in your RDD and classes which are Functions to be passed to spark) need to be Serializable
. 好的,原因是您在进动中使用的所有类(即,存储在RDD中的对象和作为传递给spark的Function的类)都必须可
Serializable
。 This means that they need to implement the Serializable
interface or you have to provide another way to serialize them as Kryo. 这意味着他们需要实现
Serializable
接口,或者您必须提供另一种方式将其序列化为Kryo。 Actually I don't know why the lambda function you are defining is not considered serializable, but I believe that this is due to the fact that a Java Comparator
takes a parameter. 实际上,我不知道为什么您定义的lambda函数不被视为可序列化的,但是我相信这是由于Java
Comparator
接受了一个参数。
However, the way to make it working is defining a Serializable
comaprator such as : 但是,使其工作的方法是定义一个可
Serializable
协同映射器,例如:
public class WordCountComparator implements Comparator<Tuple2<String, Integer>>, Serializable {
@Override
public int compare(Tuple2<String, Integer> o1, Tuple2<String, Integer> o2) {
// TODO Auto-generated method stub
return o2._2()-o1._2();
}
}
and then passing an instance of it to your takeOrdered
function as second parameter. 然后将其实例作为第二个参数传递给
takeOrdered
函数。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.