简体   繁体   English

java.util.ConcurrentModificationException:KafkaConsumer 对于多线程访问不安全

[英]java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access

I have a Scala Spark Streaming application that receives data from the same topic from 3 different Kafka producers .我有一个Scala Spark Streaming应用程序,它从 3 个不同的Kafka producers接收来自同一主题的数据。

The Spark streaming application is on machine with host 0.0.0.179 , the Kafka server is on machine with host 0.0.0.178 , the Kafka producers are on machines, 0.0.0.180 , 0.0.0.181 , 0.0.0.182 . Spark 流应用程序在主机0.0.0.179机器上,Kafka 服务器在主机0.0.0.178机器上, Kafka producers在机器0.0.0.1800.0.0.1810.0.0.182

When I try to run the Spark Streaming application got below error当我尝试运行Spark Streaming应用程序时出现以下错误

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 19.0 failed 1 times, most recent failure: Lost task 0.0 in stage 19.0 (TID 19, localhost): java.util.ConcurrentModificationException: KafkaConsumer is not safe for multi-threaded access at org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1625) at org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer.java:1198) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228) at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.rdd.PairRDDFunctions$$anon线程“main”org.apache.spark.SparkException 中的异常:由于阶段失败而中止作业:阶段 19.0 中的任务 0 失败 1 次,最近失败:阶段 19.0 中丢失任务 0.0(TID 19,本地主机):java.util .ConcurrentModificationException: KafkaConsumer 对于 org.apache.kafka.clients.consumer.KafkaConsumer.acquire(KafkaConsumer.java:1625) 的多线程访问不安全,位于 org.apache.kafka.clients.consumer.KafkaConsumer.seek(KafkaConsumer. java:1198) 在 org.apache.spark.streaming.kafka010.CachedKafkaConsumer.seek(CachedKafkaConsumer.scala:95) 在 org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:69) 在 org.apache .spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228) 在 org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194) 在 scala.collection.Iterator$$ anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.rdd.PairRDDFunctions$$anon fun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1204) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala:1203) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.T fun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply$mcV$sp(PairRDDFunctions.scala:1204) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$3$anonfunfun $apply$7.apply(PairRDDFunctions.scala:1203) at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$7.apply(PairRDDFunctions.scala) 12org apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1325) 在 org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.11org) 在:1 .spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark .scheduler.Task.run(Task.scala:85) 在 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:27​​4) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ) 在 java.util.concurrent.T hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:748) hreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 在 java.lang.Thread.run(Thread.java:748)

Now I read thousand of different posts but no one seems to be able to find a solution at this issue.现在我阅读了数千篇不同的帖子,但似乎没有人能够找到解决此问题的方法。

How can I handle this on my application?我如何在我的申请中处理这个问题? Do I have to modify some parameters on Kakfa (at the moment the num.partition parameter is set to 1)?我是否必须修改 Kakfa 上的某些参数(此时num.partition参数设置为 1)?

Following is the code of my application :以下是我的应用程序的代码:

// Create the context with a 5 second batch size
val sparkConf = new SparkConf().setAppName("SparkScript").set("spark.driver.allowMultipleContexts", "true").set("spark.streaming.concurrentJobs", "3").setMaster("local[4]")
val sc = new SparkContext(sparkConf)

val ssc = new StreamingContext(sc, Seconds(3))

case class Thema(name: String, metadata: String)
case class Tempo(unit: String, count: Int, metadata: String)
case class Spatio(unit: String, metadata: String)
case class Stt(spatial: Spatio, temporal: Tempo, thematic: Thema)
case class Location(latitude: Double, longitude: Double, name: String)

case class Datas1(location : Location, timestamp : String, windspeed : Double, direction: String, strenght : String)
case class Sensors1(sensor_name: String, start_date: String, end_date: String, data1: Datas1, stt: Stt)    


val kafkaParams = Map[String, Object](
    "bootstrap.servers" -> "0.0.0.178:9092",
    "key.deserializer" -> classOf[StringDeserializer].getCanonicalName,
    "value.deserializer" -> classOf[StringDeserializer].getCanonicalName,
    "group.id" -> "test_luca",
    "auto.offset.reset" -> "earliest",
    "enable.auto.commit" -> (false: java.lang.Boolean)
)

val topics1 = Array("topics1")

  val s1 = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(record => {
    implicit val formats = DefaultFormats
    parse(record.value).extract[Sensors1]
  } 
  )      
  s1.print()
  s1.saveAsTextFiles("results/", "")
ssc.start()
ssc.awaitTermination()

Thank you谢谢

Your problem is here:你的问题在这里:

s1.print()
s1.saveAsTextFiles("results/", "")

Since Spark creates a graph of flows, and you define two flows here:由于 Spark 创建了一个流图,您在这里定义了两个流:

Read from Kafka -> Print to console
Read from Kafka -> Save to text file

Spark will attempt to concurrently run both of these graphs, since they are independent of each other. Spark 将尝试同时运行这两个图,因为它们彼此独立。 Since Kafka uses a cached consumer approach, it is effectively trying to use the same consumer for both stream executions.由于 Kafka 使用缓存消费者方法,因此它有效地尝试对两个流执行使用相同的消费者。

What you can do is cache the DStream before running the two queries:您可以做的是在运行两个查询之前缓存DStream

val dataFromKafka = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics1, kafkaParams)).map(/* stuff */)

val cachedStream = dataFromKafka.cache()
cachedStream.print()
cachedStream.saveAsTextFiles("results/", "")

Using cache worked for me .使用缓存对我有用。 In my case print , transformation and then print on JavaPairDstream was giving me that error .就我而言,在 JavaPairDstream 上打印、转换和打印给了我那个错误。 I used cache just before first print, it worked for me.我在第一次打印之前使用了缓存,它对我有用。

s1.print()
s1.saveAsTextFiles("results/", "")

Below code will work, i used similar code .下面的代码可以工作,我使用了类似的代码。

s1.cache();
s1.print();
s1.saveAsTextFiles("results/", "");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 java.util.ConcurrentModificationException用于多线程Java代码 - java.util.ConcurrentModificationException for multi-threaded Java code spring 批量 KafkaConsumer 对于多线程访问不安全 - spring batch KafkaConsumer is not safe for multi-threaded access 具有手动提交偏移量的kafka多线程消费者:KafkaConsumer对于多线程访问是不安全的 - kafka multi-threaded consumer with manual commit offset: KafkaConsumer is not safe for multi-threaded access 多线程ArrayList上的java.util.ConcurrentModificationException - java.util.ConcurrentModificationException on ArrayList in multi thread 当我使用 CuratorFrameworkFactory.newClient() 时,获取 KafkaConsumer 对于多线程访问错误是不安全的 - Getting KafkaConsumer is not safe for multi-threaded access error when I use CuratorFrameworkFactory.newClient() Java Swing对JTextArea的多线程访问 - Java Swing multi-threaded access to JTextArea 读取多个文件-java.util.ConcurrentModificationException - Reading multiple files - java.util.ConcurrentModificationException android中的java.util.ConcurrentModificationException(GLThread 23204) - java.util.ConcurrentModificationException in android(GLThread 23204) 尝试重新创建java.util.ConcurrentModificationException - Trying to recreate java.util.ConcurrentModificationException android java.util.ConcurrentModificationException中的奇怪错误 - strange error in android java.util.ConcurrentModificationException
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM