Spark streaming Job aborted due to stage failure when reading from kafka topic

Question

I'm new in spark and kafka and I'm using spark streaming to process data coming from a kafka topic. For now, I just want to print the records in the console. I have a mini cluster with spark on two nodes (scala version 2.12.2 and spark-2.1.1) and a node with kafka (version kafka_2.11-0.10.2.0). However when I submit my code I get this error :

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 1.3.64.64, executor 1): java.lang.NoClassDefFoundError: scala/collection/GenTraversableOnce$class
    at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.<init>(KafkaRDD.scala:193)
    at org.apache.spark.streaming.kafka010.KafkaRDD.compute(KafkaRDD.scala:185)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:99)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Does it have something to do with the versions ? Or maybe my code is not correct !

Here is my code :

import java.util.UUID
import org.apache.kafka.clients.consumer.ConsumerRecord
import runtime.ScalaRunTime.stringOf
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe


object followProduction {

def main(args: Array[String]) = {

val sparkConf = new SparkConf().setMaster("spark://<real adress here : 10. ...>:7077").setAppName("followProcess")
val streamContext = new StreamingContext(sparkConf, Seconds(2))

streamContext.checkpoint("checkpoint")

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "1.3.64.66:9094",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> s"${UUID.randomUUID().toString}",
  "auto.offset.reset" -> "earliest",
  "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics = Array("test")
val stream = KafkaUtils.createDirectStream[String, String](
  streamContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)

stream.print()

//stream.map(record => (record.key, record.value)).count().print()

streamContext.start()
streamContext.awaitTermination()
}
}

And here is my built :

name := "test"
version := "1.0"
scalaVersion := "2.12.2"

libraryDependencies += "org.apache.spark" % "spark-core_2.10" % "2.1.1" %"provided"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.10" % "2.1.1" %"provided"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.10" % "2.0.0"

assemblyMergeStrategy in assembly := {
  case PathList("META-INF", xs @ _*) => MergeStrategy.discard
  case x => MergeStrategy.first
}

Any help will be appreciated and thank you for your time.

Answer 1

Spark 2.1.x is compiled against Scala 2.11, not 2.12.

Try:

scalaVersion := 2.11.11

Any 2.11.x version should work.

Also, your Kafka streaming dependency is referring to Scala 2.10, when you need 2.11:

libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.1"

Answer 2

Apart from your version mismatches, I think you are running Spark Cluster for which you need to submit all your JARS (libs) to the Spark slave machines(nodes) from your actual application where you are running with Spark driver.

You can submit jars with SparkConf using method .setJars(libs) .

Something like this

lazy val conf: SparkConf = new SparkConf()
    .setMaster(sparkMaster)
    .setAppName(sparkAppName)
    .set("spark.app.id", sparkAppId)
    .set("spark.submit.deployMode", "cluster")
    .setJars(libs) //setting jars for sparkContext

Note: libs: Seq[String] ie sequence of library paths

Spark streaming Job aborted due to stage failure when reading from kafka topic

Question

2 answers

solution1
2 ACCPTED 2017-06-08 18:57:31

solution2
0 2017-06-09 09:30:03

Spark streaming Job aborted due to stage failure when reading from kafka topic

Question

2 answers

solution1 2 ACCPTED 2017-06-08 18:57:31

solution2 0 2017-06-09 09:30:03

solution1
2 ACCPTED 2017-06-08 18:57:31

solution2
0 2017-06-09 09:30:03