[英]java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access
I have a Scala
Spark Streaming
application that receives data from the same topic from 3 different Kafka producers
.我有一个
Scala
Spark Streaming
应用程序,它从 3 个不同的Kafka producers
接收来自同一主题的数据。
The Spark streaming application is on machine with host 0.0.0.179
, the Kafka server is on machine with host 0.0.0.178
, the Kafka producers
are on machines, 0.0.0.180
, 0.0.0.181
, 0.0.0.182
. Spark 流应用程序在主机
0.0.0.179
机器上,Kafka 服务器在主机0.0.0.178
机器上, Kafka producers
在机器0.0.0.180
、 0.0.0.181
、 0.0.0.182
。
When I try to run the Spark Streaming
application got below error当我尝试运行
Spark Streaming
应用程序时出现以下错误
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 1 times, most recent failure: Lost task 0.0 in stage 19.0 (TID 19, localhost): java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1625) at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1198) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.rdd.PairRDDFunctions$$anon
线程“main”org.apache.spark.SparkException 中的异常:由于阶段失败而中止作业:阶段 19.0 中的任务 0 失败 1 次,最近失败:阶段 19.0 中丢失任务 0.0(TID 19,本地主机):java.util .ConcurrentModificationException: KafkaConsumer 对于 org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1625) 的多线程访问不安全,位于 org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer. java:1198) 在 org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) 在 org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) 在 org.apache .spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228) 在 org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194) 在 scala.collection.Iterator$$ anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.rdd.PairRDDFunctions$$anon fun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1204) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.T
fun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1204) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$3$anonfunfun $apply$7.apply(PairRDDFunctions.scala:1203) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala) 12org apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) 在 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.11org) 在:1 .spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark .scheduler.Task.run(Task.scala:85) 在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ) 在 java.util.concurrent.T hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748)
hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 在 java.lang.Thread.run(Thread.java:748)
Now I read thousand of different posts but no one seems to be able to find a solution at this issue.现在我阅读了数千篇不同的帖子,但似乎没有人能够找到解决此问题的方法。
How can I handle this on my application?我如何在我的申请中处理这个问题? Do I have to modify some parameters on Kakfa (at the moment the
num.partition
parameter is set to 1)?我是否必须修改 Kakfa 上的某些参数(此时
num.partition
参数设置为 1)?
Following is the code of my application :以下是我的应用程序的代码:
// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("local[4]")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(3))
case class Thema(name: String, metadata: String)
case class Tempo(unit: String, count: Int, metadata: String)
case class Spatio(unit: String, metadata: String)
case class Stt(spatial: Spatio, temporal: Tempo, thematic: Thema)
case class Location(latitude: Double, longitude: Double, name: String)
case class Datas1(location : Location, timestamp : String, windspeed : Double, direction: String, strenght : String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1, stt: Stt)
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "0.0.0.178:9092",
"key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
"group.id" -> "test_luca",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics1 = Array("topics1")
val s1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(record => {
implicit val formats = DefaultFormats
parse(record.value).extract[Sensors1]
}
)
s1.print()
s1.saveAsTextFiles("results/", "")
ssc.start()
ssc.awaitTermination()
Thank you谢谢
Your problem is here:你的问题在这里:
s1.print()
s1.saveAsTextFiles("results/", "")
Since Spark creates a graph of flows, and you define two flows here:由于 Spark 创建了一个流图,您在这里定义了两个流:
Read from Kafka -> Print to console
Read from Kafka -> Save to text file
Spark will attempt to concurrently run both of these graphs, since they are independent of each other. Spark 将尝试同时运行这两个图,因为它们彼此独立。 Since Kafka uses a cached consumer approach, it is effectively trying to use the same consumer for both stream executions.
由于 Kafka 使用缓存消费者方法,因此它有效地尝试对两个流执行使用相同的消费者。
What you can do is cache the DStream
before running the two queries:您可以做的是在运行两个查询之前缓存
DStream
:
val dataFromKafka = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(/* stuff */)
val cachedStream = dataFromKafka.cache()
cachedStream.print()
cachedStream.saveAsTextFiles("results/", "")
Using cache worked for me .使用缓存对我有用。 In my case print , transformation and then print on JavaPairDstream was giving me that error .
就我而言,在 JavaPairDstream 上打印、转换和打印给了我那个错误。 I used cache just before first print, it worked for me.
我在第一次打印之前使用了缓存,它对我有用。
s1.print()
s1.saveAsTextFiles("results/", "")
Below code will work, i used similar code .下面的代码可以工作,我使用了类似的代码。
s1.cache();
s1.print();
s1.saveAsTextFiles("results/", "");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.