[英]Why my Spark Direct stream is sending multiple offset commits to Kafka?
I am running the following code in Spark我在 Spark 中运行以下代码
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("KafkaTest")
.set("spark.streaming.kafka.maxRatePerPartition","10")
.set("spark.default.parallelism","10")
.set("spark.streaming.backpressure.enabled", "true")
.set("spark.scheduler.mode", "FAIR")
lazy val sparkContext = new SparkContext(sparkConf)
val sparkJob = new SparkLocal
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "kafka-270894369.spark.google.com:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group1",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> "false",
"heartbeat.interval.ms" -> "130000", //3000
"request.timeout.ms" -> "150000", //40000
"session.timeout.ms" -> "140000", //30000
"max.poll.interval.ms" -> "140000", //isn't a known config
"max.poll.records" -> "100" //2147483647
)
val streamingContext = new StreamingContext(sparkContext, Seconds(120))
val topics = Array("topicname")
val kafkaStream = KafkaUtils.createDirectStream[String, String](
streamingContext,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
def messageTuple(tuple: ConsumerRecord[String, String]): (String) = {
(null) // Removed the code
}
var offset : Array[OffsetRange] = null
kafkaStream.foreachRDD{rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
offset = offsetRanges
rdd.map(row => messageTuple(row))
.foreachPartition { partition =>
partition.map(row => null)
.foreach{ record =>
print("")
Thread.sleep(5)
}
}
kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
streamingContext.start()
streamingContext.awaitTerminationOrTimeout(6000000)
sys.ShutdownHookThread{
println("Gracefully shutting down App")
streamingContext.stop(true,true)
println("Application stopped")
}
With the above code I am observing multiple commits are sending to Kafka and I am not sure why ?使用上面的代码,我观察到多个提交发送到 Kafka,我不知道为什么?
(Got the below from __consumer_offset topic) (从 __consumer_offset 主题获得以下内容)
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730000011, expireTimestamp=Some(1577816400011))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730000012, expireTimestamp=Some(1577816400012))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864005827, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730000079, expireTimestamp=Some(1577816400079))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120008, expireTimestamp=Some(1577816520008))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120010, expireTimestamp=Some(1577816520010))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120077, expireTimestamp=Some(1577816520077))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240010, expireTimestamp=Some(1577816640010))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240015, expireTimestamp=Some(1577816640015))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240137, expireTimestamp=Some(1577816640137))
Ideally we should see only one commit for every 2mins based on my batch size but in our case we are observing 3 commits.理想情况下,根据我的批量大小,我们应该每 2 分钟只看到一次提交,但在我们的情况下,我们观察到 3 次提交。
Also during the Application restart we are loosing the data because of above issue (Commit mismatch)同样在应用程序重启期间,由于上述问题(提交不匹配),我们丢失了数据
Please help me with your inputs ?请帮我输入你的意见?
You're committing offsets within a forEachRDD.您正在 forEachRDD 内提交偏移量。 You should see 1 commit literally "for each RDD" , and it appears you had at least three你应该看到 1 个提交,字面上是 "for each RDD" ,看起来你至少有三个
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.