![](/img/trans.png)
[英]Spark: Reading Avro messages from Kafka using Spark Scala
[英]Spark batch reading from Kafka & using Kafka to keep track of offsets
我知道使用 Kafka 自己的偏移跟踪而不是其他方法(如检查点)对于流式作业是有问题的。
但是我只想每天运行一个 Spark 批处理作业,读取从最后一个偏移量到最近偏移量的所有消息,并用它做一些 ETL。
理论上,我想像这样读取这些数据:
val dataframe = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:6001")
.option("subscribe", "topic-in")
.option("includeHeaders", "true")
.option("kafka.group.id", s"consumer-group-for-this-job")
.load()
并让 Spark 根据group.id
将偏移量提交回 Kafka
不幸的是,Spark 从来没有将这些提交回来,所以我创造性地在我的 etl 工作结束时添加了这段代码,用于手动更新 Kafka 中消费者的偏移量:
val offsets: Map[TopicPartition, OffsetAndMetadata] = dataFrame
.select('topic, 'partition, 'offset)
.groupBy("topic", "partition")
.agg(max('offset))
.as[(String, Int, Long)]
.collect()
.map {
case (topic, partition, maxOffset) => new TopicPartition(topic, partition) -> new OffsetAndMetadata(maxOffset)
}
.toMap
val props = new Properties()
props.put("group.id", "consumer-group-for-this-job")
props.put("bootstrap.servers", "localhost:6001")
props.put("key.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put("enable.auto.commit", "false")
val kafkaConsumer = new KafkaConsumer[Array[Byte], Array[Byte]](props)
kafkaConsumer.commitSync(offsets.asJava)
这在技术上是可行的,但下一次基于这个 group.id 阅读 Spark 仍然会从头开始。
我是否必须咬紧牙关并在某处跟踪偏移量,还是我忽略了一些东西?
顺便说一句,我正在使用EmbeddedKafka进行测试
“但是我只想每天运行一个 Spark 批处理作业,读取从最后一个偏移量到最近一个偏移量的所有消息,并用它做一些 ETL。”
Trigger.Once
正是针对这种要求而设计的。
Databricks 有一篇不错的博客解释了为什么“Streaming and RunOnce is Better than Batch”。
最重要的是:
“当您运行执行增量更新的批处理作业时,您通常必须弄清楚哪些数据是新的,应该处理什么,不应该处理什么。结构化流已经为您完成了这一切。”
尽管您的方法在技术上有效,但我真的建议让 Spark 负责偏移管理。
它可能不适用于 EmbeddedKafka,因为它仅在 memory 中运行,并且不记得您在测试代码的运行之间提交了一些偏移量。 因此,它从最早的偏移量开始一次又一次地读取。
我设法通过保留spark.read
原样,忽略 group.id 等来解决它。但是用我自己的 KafkaConsumer 逻辑围绕它。
protected val kafkaConsumer: String => KafkaConsumer[Array[Byte], Array[Byte]] =
groupId => {
val props = new Properties()
props.put(ConsumerConfig.GROUP_ID_CONFIG, groupId)
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, config.bootstrapServers)
props.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.ByteArrayDeserializer")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false")
new KafkaConsumer[Array[Byte], Array[Byte]](props)
}
protected def getPartitions(kafkaConsumer: KafkaConsumer[_, _], topic: String): List[TopicPartition] = {
import scala.collection.JavaConverters._
kafkaConsumer
.partitionsFor(topic)
.asScala
.map(p => new TopicPartition(topic, p.partition()))
.toList
}
protected def getPartitionOffsets(kafkaConsumer: KafkaConsumer[_, _], topic: String, partitions: List[TopicPartition]): Map[String, Map[String, Long]] = {
Map(
topic -> partitions
.map(p => p.partition().toString -> kafkaConsumer.position(p))
.map {
case (partition, offset) if offset == 0L => partition -> -2L
case mapping => mapping
}
.toMap
)
}
def getStartingOffsetsString(kafkaConsumer: KafkaConsumer[_, _], topic: String)(implicit logger: Logger): String = {
Try {
import scala.collection.JavaConverters._
val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)
kafkaConsumer.assign(partitions.asJava)
val startOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)
logger.debug(s"Starting offsets for $topic: ${startOffsets(topic).filterNot(_._2 == -2L)}")
implicit val formats = org.json4s.DefaultFormats
Serialization.write(startOffsets)
} match {
case Success(jsonOffsets) => jsonOffsets
case Failure(e) =>
logger.error(s"Failed to retrieve starting offsets for $topic: ${e.getMessage}")
"earliest"
}
}
// MAIN CODE
val groupId = consumerGroupId(name)
val currentKafkaConsumer = kafkaConsumer(groupId)
val topic = config.topic.getOrElse(name)
val startingOffsets = getStartingOffsetsString(currentKafkaConsumer, topic)
val dataFrame = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", config.bootstrapServers)
.option("subscribe", topic)
.option("includeHeaders", "true")
.option("startingOffsets", startingOffsets)
.option("enable.auto.commit", "false")
.load()
Try {
import scala.collection.JavaConverters._
val partitions: List[TopicPartition] = getPartitions(kafkaConsumer, topic)
val numRecords = dataFrame.cache().count() // actually read data from kafka
kafkaConsumer.seekToEnd(partitions.asJava) // assume the read has head everything
val endOffsets: Map[String, Map[String, Long]] = getPartitionOffsets(kafkaConsumer, topic, partitions)
logger.debug(s"Loaded $numRecords records")
logger.debug(s"Ending offsets for $topic: ${endOffsets(topic).filterNot(_._2 == -2L)}")
kafkaConsumer.commitSync()
kafkaConsumer.close()
} match {
case Success(_) => ()
case Failure(e) =>
logger.error(s"Failed to set offsets for $topic: ${e.getMessage}")
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.