[英]Spark Parallel Stream - object not serializable
我正在使用Spark的多输入流阅读器从Kafka读取消息。 我得到下面提到的错误。 如果我不使用多个输入流阅读器,则不会出现任何错误。 为了达到性能,我需要使用并行概念,而测试目的我仅使用一个。
错误
java.io.NotSerializableException: org.apache.kafka.clients.consumer.ConsumerRecord
Serialization stack:
- object not serializable (class: org.apache.kafka.clients.consumer.ConsumerRecord, value: ConsumerRecord(topic = test, partition = 0, offset = 120, CreateTime = -1, checksum = 2372777361, serialized key size = -1, serialized value size = 48, key = null, value = 10051,2018-03-15 17:12:24+0000,Bentonville,Gnana))
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.SerializationStream.writeValue(Serializer.scala:134)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:239)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:151)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
18/03/15 17:12:24 ERROR TaskSetManager: Task 0.0 in stage 470.0 (TID 470) had a not serializable result: org.apache.kafka.clients.consumer.ConsumerRecord
码:
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.Success
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Milliseconds, StreamingContext}
object ParallelStreamJob {
def main(args: Array[String]): Unit = {
val spark = SparkHelper.getOrCreateSparkSession()
val ssc = new StreamingContext(spark.sparkContext, Milliseconds(50))
val kafkaStream = {
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "use_a_separate_group_id_for_each_stream",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("test")
val numPartitionsOfInputTopic = 1
val streams = (1 to numPartitionsOfInputTopic) map { _ =>
KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
}
val unifiedStream = ssc.union(streams)
val sparkProcessingParallelism = 1
unifiedStream.repartition(sparkProcessingParallelism)
}
kafkaStream.foreachRDD(rdd=> {
rdd.foreach(conRec=> {
println(conRec.value())
})
})
println(" Spark parallel reader is ready !!!")
ssc.start()
ssc.awaitTermination()
}
}
SBT
scalaVersion := "2.11.8"
val sparkVersion = "2.2.0"
val connectorVersion = "2.0.7"
val kafka_stream_version = "1.6.3"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion ,
"org.apache.spark" %% "spark-sql" % sparkVersion ,
"org.apache.spark" %% "spark-hive" % sparkVersion ,
"com.datastax.spark" %% "spark-cassandra-connector" % connectorVersion ,
"org.apache.kafka" %% "kafka" % "0.10.1.0",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion ,
)
如何解决这个问题?
问题很明显java.io.NotSerializableException:org.apache.kafka.clients.consumer.ConsumerRecord
。 ConsumerRecord类不扩展Serializable
尝试在foreachRdd
操作kafkaStream.map(_.value())
之前foreachRdd
ConsumerRecord
value
字段。
更新1:上面的修复不起作用,因为ssc.union(streams)
发生异常。 ssc.union(streams)
需要节点之间的数据传输,它必须序列化数据。 因此,您可以在union
操作之前按map
取出value
字段以解决问题。
KafkaUtils.createDirectStream[String, String]( ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParam) ).map(_.value())
首先,如果您有一个主题,则不应该使用创建多个Kafkastream的方式,而是使用直接方法,该方法将自动创建与主题的Kafka分区数量相同数量的线程.Spark将自动采用如果遵循DirectApproach,则需要并行处理任务。 尝试在RDD级别上使用repartition(),而不是对Dstream本身重新分区。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.