简体   繁体   English

为什么我的 Spark Direct 流向 Kafka 发送多个偏移提交?

[英]Why my Spark Direct stream is sending multiple offset commits to Kafka?

I am running the following code in Spark我在 Spark 中运行以下代码

  val sparkConf = new SparkConf()
    .setMaster("local[*]")
    .setAppName("KafkaTest")
    .set("spark.streaming.kafka.maxRatePerPartition","10")
    .set("spark.default.parallelism","10")
    .set("spark.streaming.backpressure.enabled", "true")
    .set("spark.scheduler.mode", "FAIR")

  lazy val sparkContext = new SparkContext(sparkConf)
  val sparkJob = new SparkLocal

  val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "kafka-270894369.spark.google.com:9092",
      "key.deserializer" -> classOf[StringDeserializer],
      "value.deserializer" -> classOf[StringDeserializer],
      "group.id" -> "stream_group1",
      "auto.offset.reset" -> "latest",
      "enable.auto.commit" -> "false",
      "heartbeat.interval.ms" -> "130000", //3000
      "request.timeout.ms" -> "150000", //40000
      "session.timeout.ms" -> "140000", //30000
      "max.poll.interval.ms" -> "140000", //isn't a known config
      "max.poll.records" -> "100" //2147483647
    )

    val streamingContext = new StreamingContext(sparkContext, Seconds(120))

    val topics = Array("topicname")

    val kafkaStream = KafkaUtils.createDirectStream[String, String](
      streamingContext,
      PreferConsistent,
      Subscribe[String, String](topics, kafkaParams)
    )

    def messageTuple(tuple: ConsumerRecord[String, String]): (String) = {
      (null) // Removed the code
    }

    var offset : Array[OffsetRange] = null

    kafkaStream.foreachRDD{rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      offset = offsetRanges

      rdd.map(row => messageTuple(row))
        .foreachPartition { partition =>
          partition.map(row => null)
            .foreach{ record =>
              print("")
              Thread.sleep(5)
            }
          }
      kafkaStream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      }

    streamingContext.start()
    streamingContext.awaitTerminationOrTimeout(6000000)

    sys.ShutdownHookThread{
      println("Gracefully shutting down App")
      streamingContext.stop(true,true)
      println("Application stopped")
    }

With the above code I am observing multiple commits are sending to Kafka and I am not sure why ?使用上面的代码,我观察到多个提交发送到 Kafka,我不知道为什么?

(Got the below from __consumer_offset topic) (从 __consumer_offset 主题获得以下内容)

[stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730000011, expireTimestamp=Some(1577816400011))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864006531, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730000012, expireTimestamp=Some(1577816400012))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864005827, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730000079, expireTimestamp=Some(1577816400079))

[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120008, expireTimestamp=Some(1577816520008))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120010, expireTimestamp=Some(1577816520010))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008524, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730120077, expireTimestamp=Some(1577816520077))

[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240010, expireTimestamp=Some(1577816640010))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240015, expireTimestamp=Some(1577816640015))
[stream_group1,topicname,59]::OffsetAndMetadata(offset=864008959, leaderEpoch=Optional.empty, metadata=, commitTimestamp=1577730240137, expireTimestamp=Some(1577816640137))

Ideally we should see only one commit for every 2mins based on my batch size but in our case we are observing 3 commits.理想情况下,根据我的批量大小,我们应该每 2 分钟只看到一次提交,但在我们的情况下,我们观察到 3 次提交。

Also during the Application restart we are loosing the data because of above issue (Commit mismatch)同样在应用程序重启期间,由于上述问题(提交不匹配),我们丢失了数据

Please help me with your inputs ?请帮我输入你的意见?

You're committing offsets within a forEachRDD.您正在 forEachRDD 内提交偏移量。 You should see 1 commit literally "for each RDD" , and it appears you had at least three你应该看到 1 个提交,字面上是 "for each RDD" ,看起来你至少有三个

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM