从Kafka主题读取时，Spark流作业因阶段故障而中止

Question

I'm new in spark and kafka and I'm using spark streaming to process data coming from a kafka topic. 我是Spark和Kafka的新手，并且正在使用Spark Streaming处理来自kafka主题的数据。 For now, I just want to print the records in the console. 现在，我只想在控制台中打印记录。 I have a mini cluster with spark on two nodes (scala version 2.12.2 and spark-2.1.1) and a node with kafka (version kafka_2.11-0.10.2.0). 我有一个在两个节点（scala版本2.12.2和spark-2.1.1）上具有spark的小型集群，以及一个具有kafka（版本kafka_2.11-0.10.2.0）的节点。 However when I submit my code I get this error : 但是，当我提交代码时，出现此错误：

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 1.3.64.64, executor 1): java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
    at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.<init>(KafkaRDD.scala:193)
    at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:185)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Does it have something to do with the versions ? 它与版本有关吗？ Or maybe my code is not correct ! 也许我的代码不正确！

Here is my code : 这是我的代码：

import java.util.UUID
import org.apache.kafka.clients.consumer.ConsumerRecord
import runtime.ScalaRunTime.stringOf
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe


object followProduction {

def main(args: Array[String]) = {

val sparkConf = new SparkConf().setMaster("spark://<real adress here : 10. ...>:7077").setAppName("followProcess")
val streamContext = new StreamingContext(sparkConf, Seconds(2))

streamContext.checkpoint("checkpoint")

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "1.3.64.66:9094",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> s"${UUID.randomUUID().toString}",
  "auto.offset.reset" -> "earliest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics = Array("test")
val stream = KafkaUtils.createDirectStream[String, String](
  streamContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

stream.print()

//stream.map(record => (record.key, record.value)).count().print()

streamContext.start()
streamContext.awaitTermination()
}
}

And here is my built : 这是我建造的：

name := "test"
version := "1.0"
scalaVersion := "2.12.2"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "2.1.1" %"provided"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "2.1.1" %"provided"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.10" % "2.0.0"

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

Any help will be appreciated and thank you for your time. 任何帮助将不胜感激，并感谢您的时间。

Answer 1

Spark 2.1.x is compiled against Scala 2.11, not 2.12. Spark 2.1.x是针对Scala 2.11（而非2.12）编译的。

Try: 尝试：

scalaVersion := 2.11.11

Any 2.11.x version should work. 任何2.11.x版本都可以使用。

Also, your Kafka streaming dependency is referring to Scala 2.10, when you need 2.11: 另外，当您需要2.11时，您的Kafka流媒体依赖关系指的是Scala 2.10：

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.1"

Answer 2

Apart from your version mismatches, I think you are running Spark Cluster for which you need to submit all your JARS (libs) to the Spark slave machines(nodes) from your actual application where you are running with Spark driver. 除了版本不匹配外，我认为您正在运行Spark Cluster，需要将其所有JARS （库）从使用Spark驱动程序运行的实际应用程序提交到Spark从属机器（节点）。

You can submit jars with SparkConf using method .setJars(libs) . 您可以使用.setJars(libs)方法通过SparkConf提交jars 。

Something like this 像这样

lazy val conf: SparkConf = new SparkConf()
    .setMaster(sparkMaster)
    .setAppName(sparkAppName)
    .set("spark.app.id", sparkAppId)
    .set("spark.submit.deployMode", "cluster")
    .setJars(libs) //setting jars for sparkContext

Note: libs: Seq[String] ie sequence of library paths 注意： libs: Seq[String]即库路径的顺序

从Kafka主题读取时，Spark流作业因阶段故障而中止

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-06-08 18:57:31

解决方案2
0 2017-06-09 09:30:03

从Kafka主题读取时，Spark流作业因阶段故障而中止

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-06-08 18:57:31

解决方案2 0 2017-06-09 09:30:03

解决方案1
2 已采纳 2017-06-08 18:57:31

解决方案2
0 2017-06-09 09:30:03