简体   繁体   English

从Kafka主题读取时,Spark流作业因阶段故障而中止

[英]Spark streaming Job aborted due to stage failure when reading from kafka topic

I'm new in spark and kafka and I'm using spark streaming to process data coming from a kafka topic. 我是Spark和Kafka的新手,并且正在使用Spark Streaming处理来自kafka主题的数据。 For now, I just want to print the records in the console. 现在,我只想在控制台中打印记录。 I have a mini cluster with spark on two nodes (scala version 2.12.2 and spark-2.1.1) and a node with kafka (version kafka_2.11-0.10.2.0). 我有一个在两个节点(scala版本2.12.2和spark-2.1.1)上具有spark的小型集群,以及一个具有kafka(版本kafka_2.11-0.10.2.0)的节点。 However when I submit my code I get this error : 但是,当我提交代码时,出现此错误:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 1.3.64.64, executor 1): java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
    at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.<init>(KafkaRDD.scala:193)
    at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:185)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Does it have something to do with the versions ? 它与版本有关吗? Or maybe my code is not correct ! 也许我的代码不正确!

Here is my code : 这是我的代码:

import java.util.UUID
import org.apache.kafka.clients.consumer.ConsumerRecord
import runtime.ScalaRunTime.stringOf
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe


object followProduction {

def main(args: Array[String]) = {

val sparkConf = new SparkConf().setMaster("spark://<real adress here : 10. ...>:7077").setAppName("followProcess")
val streamContext = new StreamingContext(sparkConf, Seconds(2))

streamContext.checkpoint("checkpoint")

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "1.3.64.66:9094",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> s"${UUID.randomUUID().toString}",
  "auto.offset.reset" -> "earliest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics = Array("test")
val stream = KafkaUtils.createDirectStream[String, String](
  streamContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

stream.print()

//stream.map(record => (record.key, record.value)).count().print()

streamContext.start()
streamContext.awaitTermination()
}
}

And here is my built : 这是我建造的:

name := "test"
version := "1.0"
scalaVersion := "2.12.2"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "2.1.1" %"provided"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "2.1.1" %"provided"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.10" % "2.0.0"

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

Any help will be appreciated and thank you for your time. 任何帮助将不胜感激,并感谢您的时间。

Spark 2.1.x is compiled against Scala 2.11, not 2.12. Spark 2.1.x是针对Scala 2.11(而非2.12)编译的。

Try: 尝试:

scalaVersion := 2.11.11

Any 2.11.x version should work. 任何2.11.x版本都可以使用。

Also, your Kafka streaming dependency is referring to Scala 2.10, when you need 2.11: 另外,当您需要2.11时,您的Kafka流媒体依赖关系指的是Scala 2.10:

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.1"

Apart from your version mismatches, I think you are running Spark Cluster for which you need to submit all your JARS (libs) to the Spark slave machines(nodes) from your actual application where you are running with Spark driver. 除了版本不匹配外,我认为您正在运行Spark Cluster,需要将其所有JARS (库)从使用Spark驱动程序运行的实际应用程序提交到Spark从属机器(节点)。

You can submit jars with SparkConf using method .setJars(libs) . 您可以使用.setJars(libs)方法通过SparkConf提交jars

Something like this 像这样

lazy val conf: SparkConf = new SparkConf()
    .setMaster(sparkMaster)
    .setAppName(sparkAppName)
    .set("spark.app.id", sparkAppId)
    .set("spark.submit.deployMode", "cluster")
    .setJars(libs) //setting jars for sparkContext

Note: libs: Seq[String] ie sequence of library paths 注意: libs: Seq[String]即库路径的顺序

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 org.apache.spark.SparkException:由于阶段失败,作业中止了: - org.apache.spark.SparkException: Job aborted due to stage failure: SparkException:作业由于阶段故障而中止:使用Spark-Graphx时发生NullPointerException - SparkException: Job aborted due to stage failure: NullPointerException when working with Spark-Graphx org.apache.spark.SparkException:作业因阶段失败而中止:阶段11.0中的任务98失败4次 - org.apache.spark.SparkException: Job aborted due to stage failure: Task 98 in stage 11.0 failed 4 times org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段2.0中的任务0 - org.apache.spark.SparkException: job aborted due to stage failure: Task 0 in stage 2.0 org.apache.spark.SparkException:作业由于阶段故障而中止-OOM异常 - org.apache.spark.SparkException: Job aborted due to stage failure - OOM Exception org.apache.spark.SparkException:作业由于阶段失败而中止:java.lang.NullPointerException - org.apache.spark.SparkException:Job aborted due to stage failure :java.lang.NullPointerException 使用scala从kafka主题流式传输Spark - Spark streaming from kafka topic using scala Kafka + Spark 流:单个作业中的多主题处理 - Kafka + spark streaming : Multi topic processing in single job 如何在Spark Streaming作业的每批中使用不同的Kafka主题? - How to consume from a different Kafka topic in each batch of a Spark Streaming job? Spark Streaming + Kafka:如何从kafka消息中检查主题名称 - Spark Streaming + Kafka: how to check name of topic from kafka message
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM