![](/img/trans.png)
[英]Spark: Reading Avro messages from Kafka using Spark Scala
[英]Spark batch reading from Kafka & using Kafka to keep track of offsets
我知道使用 Kafka 自己的偏移跟蹤而不是其他方法(如檢查點)對於流式作業是有問題的。
但是我只想每天運行一個 Spark 批處理作業,讀取從最后一個偏移量到最近偏移量的所有消息,並用它做一些 ETL。
理論上,我想像這樣讀取這些數據:
val dataframe = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:6001")
.option("subscribe", "topic-in")
.option("includeHeaders", "true")
.option("kafka.group.id", s"consumer-group-for-this-job")
.load()
並讓 Spark 根據group.id
將偏移量提交回 Kafka
不幸的是,Spark 從來沒有將這些提交回來,所以我創造性地在我的 etl 工作結束時添加了這段代碼,用於手動更新 Kafka 中消費者的偏移量:
val offsets: Map[TopicPartition, OffsetAndMetadata] = dataFrame
.select('topic, 'partition, 'offset)
.groupBy("topic", "partition")
.agg(max('offset))
.as[(String, Int, Long)]
.collect()
.map {
case (topic, partition, maxOffset) => new TopicPartition(topic, partition) -> new OffsetAndMetadata(maxOffset)
}
.toMap
val props = new Properties()
props.put("group.id", "consumer-group-for-this-job")
props.put("bootstrap.servers", "localhost:6001")
props.put("key.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("enable.auto.commit", "false")
val kafkaConsumer = new KafkaConsumer[Array[Byte], Array[Byte]](props)
kafkaConsumer.commitSync(offsets.asJava)
這在技術上是可行的,但下一次基於這個 group.id 閱讀 Spark 仍然會從頭開始。
我是否必須咬緊牙關並在某處跟蹤偏移量,還是我忽略了一些東西?
順便說一句,我正在使用EmbeddedKafka進行測試
“但是我只想每天運行一個 Spark 批處理作業,讀取從最后一個偏移量到最近一個偏移量的所有消息,並用它做一些 ETL。”
Trigger.Once
正是針對這種要求而設計的。
Databricks 有一篇不錯的博客解釋了為什么“Streaming and RunOnce is Better than Batch”。
最重要的是:
“當您運行執行增量更新的批處理作業時,您通常必須弄清楚哪些數據是新的,應該處理什么,不應該處理什么。結構化流已經為您完成了這一切。”
盡管您的方法在技術上有效,但我真的建議讓 Spark 負責偏移管理。
它可能不適用於 EmbeddedKafka,因為它僅在 memory 中運行,並且不記得您在測試代碼的運行之間提交了一些偏移量。 因此,它從最早的偏移量開始一次又一次地讀取。
我設法通過保留spark.read
原樣,忽略 group.id 等來解決它。但是用我自己的 KafkaConsumer 邏輯圍繞它。
protected val kafkaConsumer: String => KafkaConsumer[Array[Byte], Array[Byte]] =
groupId => {
val props = new Properties()
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, config.bootstrapServers)
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
new KafkaConsumer[Array[Byte], Array[Byte]](props)
}
protected def getPartitions(kafkaConsumer: KafkaConsumer[_, _], topic: String): List[TopicPartition] = {
import scala.collection.JavaConverters._
kafkaConsumer
.partitionsFor(topic)
.asScala
.map(p => new TopicPartition(topic, p.partition()))
.toList
}
protected def getPartitionOffsets(kafkaConsumer: KafkaConsumer[_, _], topic: String, partitions: List[TopicPartition]): Map[String, Map[String, Long]] = {
Map(
topic -> partitions
.map(p => p.partition().toString -> kafkaConsumer.position(p))
.map {
case (partition, offset) if offset == 0L => partition -> -2L
case mapping => mapping
}
.toMap
)
}
def getStartingOffsetsString(kafkaConsumer: KafkaConsumer[_, _], topic: String)(implicit logger: Logger): String = {
Try {
import scala.collection.JavaConverters._
val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)
kafkaConsumer.assign(partitions.asJava)
val startOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)
logger.debug(s"Starting offsets for $topic: ${startOffsets(topic).filterNot(_._2 == -2L)}")
implicit val formats = org.json4s.DefaultFormats
Serialization.write(startOffsets)
} match {
case Success(jsonOffsets) => jsonOffsets
case Failure(e) =>
logger.error(s"Failed to retrieve starting offsets for $topic: ${e.getMessage}")
"earliest"
}
}
// MAIN CODE
val groupId = consumerGroupId(name)
val currentKafkaConsumer = kafkaConsumer(groupId)
val topic = config.topic.getOrElse(name)
val startingOffsets = getStartingOffsetsString(currentKafkaConsumer, topic)
val dataFrame = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", config.bootstrapServers)
.option("subscribe", topic)
.option("includeHeaders", "true")
.option("startingOffsets", startingOffsets)
.option("enable.auto.commit", "false")
.load()
Try {
import scala.collection.JavaConverters._
val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)
val numRecords = dataFrame.cache().count() // actually read data from kafka
kafkaConsumer.seekToEnd(partitions.asJava) // assume the read has head everything
val endOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)
logger.debug(s"Loaded $numRecords records")
logger.debug(s"Ending offsets for $topic: ${endOffsets(topic).filterNot(_._2 == -2L)}")
kafkaConsumer.commitSync()
kafkaConsumer.close()
} match {
case Success(_) => ()
case Failure(e) =>
logger.error(s"Failed to set offsets for $topic: ${e.getMessage}")
}
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.