Spark 从 Kafka 批量读取并使用 Kafka 跟踪偏移量

Question

I understand that using Kafka's own offset tracking instead of other methods (like checkpointing) is problematic for streaming jobs.我知道使用 Kafka 自己的偏移跟踪而不是其他方法（如检查点）对于流式作业是有问题的。

However I just want to run a Spark batch job every day, reading all messages from the last offset to the most recent and do some ETL with it.但是我只想每天运行一个 Spark 批处理作业，读取从最后一个偏移量到最近偏移量的所有消息，并用它做一些 ETL。

In theory I want to read this data like so:理论上，我想像这样读取这些数据：

val dataframe = spark.read
      .format("kafka")
      .option("kafka.bootstrap.servers", "localhost:6001")
      .option("subscribe", "topic-in")
      .option("includeHeaders", "true")
      .option("kafka.group.id", s"consumer-group-for-this-job")
      .load()

And have Spark commit the offsets back to Kafka based on the group.id并让 Spark 根据group.id将偏移量提交回 Kafka

Unfortunately Spark never commits these back, so I went creative and added in the end of my etl job, this code to manually update the offsets for the consumer in Kafka:不幸的是，Spark 从来没有将这些提交回来，所以我创造性地在我的 etl 工作结束时添加了这段代码，用于手动更新 Kafka 中消费者的偏移量：

val offsets: Map[TopicPartition, OffsetAndMetadata] = dataFrame
      .select('topic, 'partition, 'offset)
      .groupBy("topic", "partition")
      .agg(max('offset))
      .as[(String, Int, Long)]
      .collect()
      .map {
        case (topic, partition, maxOffset) => new TopicPartition(topic, partition) -> new OffsetAndMetadata(maxOffset)
      }
      .toMap

val props = new Properties()
    props.put("group.id", "consumer-group-for-this-job")
    props.put("bootstrap.servers", "localhost:6001")
    props.put("key.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
    props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
    props.put("enable.auto.commit", "false")
    val kafkaConsumer = new KafkaConsumer[Array[Byte], Array[Byte]](props)

    kafkaConsumer.commitSync(offsets.asJava)

Which technically works, but still next time reading based on this group.id Spark will still start from the beginning.这在技术上是可行的，但下一次基于这个 group.id 阅读 Spark 仍然会从头开始。

Do I have to bite the bullet and keep track of the offsets somewhere, or is there something I'm overlooking?我是否必须咬紧牙关并在某处跟踪偏移量，还是我忽略了一些东西？

BTW I'm testing this with EmbeddedKafka顺便说一句，我正在使用EmbeddedKafka进行测试

Answer 1

"However I just want to run a Spark batch job every day, reading all messages from the last offset to the most recent and do some ETL with it." “但是我只想每天运行一个 Spark 批处理作业，读取从最后一个偏移量到最近一个偏移量的所有消息，并用它做一些 ETL。”

The Trigger.Once is exactly made for this kind of requirement. Trigger.Once正是针对这种要求而设计的。

There is a nice blog from Databricks that explains why "Streaming and RunOnce is Better than Batch". Databricks 有一篇不错的博客解释了为什么“Streaming and RunOnce is Better than Batch”。

Most importantly:最重要的是：

"When you're running a batch job that performs incremental updates, you generally have to deal with figuring out what data is new, what you should process, and what you should not. Structured Streaming already does all this for you." “当您运行执行增量更新的批处理作业时，您通常必须弄清楚哪些数据是新的，应该处理什么，不应该处理什么。结构化流已经为您完成了这一切。”

Although your approach is working technically, I would really recommend to have Spark take care of the offset management.尽管您的方法在技术上有效，但我真的建议让 Spark 负责偏移管理。

It probably does not work with EmbeddedKafka as this is running only in memory and not remembering that you have committed some offsets between runs of your test code.它可能不适用于 EmbeddedKafka，因为它仅在 memory 中运行，并且不记得您在测试代码的运行之间提交了一些偏移量。 Therefore, it starts reading again and again from earliest offset.因此，它从最早的偏移量开始一次又一次地读取。

Answer 2

I managed to resolve it by leaving the spark.read as is, ignoring the group.id etc. But surrounding it with my own KafkaConsumer logic.我设法通过保留spark.read原样，忽略 group.id 等来解决它。但是用我自己的 KafkaConsumer 逻辑围绕它。

 protected val kafkaConsumer: String => KafkaConsumer[Array[Byte], Array[Byte]] =
    groupId => {
      val props = new Properties()
      props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
      props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, config.bootstrapServers)
      props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
      props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
      props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
      props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
      new KafkaConsumer[Array[Byte], Array[Byte]](props)
    }

  protected def getPartitions(kafkaConsumer: KafkaConsumer[_, _], topic: String): List[TopicPartition] = {
    import scala.collection.JavaConverters._

    kafkaConsumer
      .partitionsFor(topic)
      .asScala
      .map(p => new TopicPartition(topic, p.partition()))
      .toList
  }

  protected def getPartitionOffsets(kafkaConsumer: KafkaConsumer[_, _], topic: String, partitions: List[TopicPartition]): Map[String, Map[String, Long]] = {
    Map(
      topic -> partitions
        .map(p => p.partition().toString -> kafkaConsumer.position(p))
        .map {
          case (partition, offset) if offset == 0L => partition -> -2L
          case mapping                             => mapping
        }
        .toMap
    )
  }

def getStartingOffsetsString(kafkaConsumer: KafkaConsumer[_, _], topic: String)(implicit logger: Logger): String = {
    Try {
      import scala.collection.JavaConverters._

      val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)

      kafkaConsumer.assign(partitions.asJava)

      val startOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)

      logger.debug(s"Starting offsets for $topic: ${startOffsets(topic).filterNot(_._2 == -2L)}")

      implicit val formats = org.json4s.DefaultFormats
      Serialization.write(startOffsets)
    } match {
      case Success(jsonOffsets) => jsonOffsets
      case Failure(e) =>
        logger.error(s"Failed to retrieve starting offsets for $topic: ${e.getMessage}")
        "earliest"
    }
  }

// MAIN CODE

    val groupId              = consumerGroupId(name)
    val currentKafkaConsumer = kafkaConsumer(groupId)
    val topic                = config.topic.getOrElse(name)

    val startingOffsets = getStartingOffsetsString(currentKafkaConsumer, topic)

    val dataFrame = spark.read
      .format("kafka")
      .option("kafka.bootstrap.servers", config.bootstrapServers)
      .option("subscribe", topic)
      .option("includeHeaders", "true")
      .option("startingOffsets", startingOffsets)
      .option("enable.auto.commit", "false")
      .load()

Try {
  import scala.collection.JavaConverters._

  val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)

  val numRecords = dataFrame.cache().count() // actually read data from kafka
  kafkaConsumer.seekToEnd(partitions.asJava) // assume the read has head everything

  val endOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)

  logger.debug(s"Loaded $numRecords records")
  logger.debug(s"Ending offsets for $topic: ${endOffsets(topic).filterNot(_._2 == -2L)}")

  kafkaConsumer.commitSync()
  kafkaConsumer.close()
} match {
  case Success(_) => ()
  case Failure(e) =>
    logger.error(s"Failed to set offsets for $topic: ${e.getMessage}")
}

Spark 从 Kafka 批量读取并使用 Kafka 跟踪偏移量

问题描述

2 个解决方案

解决方案1
1 2021-01-27 22:14:48

解决方案2
0 2021-01-28 15:31:33

Spark 从 Kafka 批量读取并使用 Kafka 跟踪偏移量

问题描述

2 个解决方案

解决方案1 1 2021-01-27 22:14:48

解决方案2 0 2021-01-28 15:31:33

解决方案1
1 2021-01-27 22:14:48

解决方案2
0 2021-01-28 15:31:33