简体   繁体   English

如何从 Spark 结构化流作业中的每个微批次中的相同起始偏移量读取?

[英]How do I read from same starting offset in each micro batch in spark structured streaming job?

I am using spark structured streaming.我正在使用火花结构化流。 Is it possible to reset Kafka offset after every batch execution so that every batch read from same starting offset instead of only newly discovered events?是否可以在每次批处理执行后重置 Kafka 偏移量,以便每个批处理都从相同的起始偏移量读取,而不仅仅是新发现的事件?

Quoting description of startingOffsets from spark Kafka integration documentation here 在此处引用 spark Kafka 集成文档中的startingOffsets描述

For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off.对于流式查询,这仅适用于新查询开始时,并且恢复将始终从查询停止的地方开始。 Newly discovered partitions during a query will start at earliest.查询期间新发现的分区将最早开始。

Right now I am doing it by creating a static dataframe from Kafka inside for each batch loop and using a dummy streaming dataset with format as "rate".现在,我正在通过为每个批处理循环从 Kafka 内部创建 static dataframe 并使用格式为“rate”的虚拟流数据集来做到这一点。 Wondering if there is a better way to do it想知道是否有更好的方法来做到这一点

For Structured Streaming can set startingOffsets to earliest so that every time you consume from the earliest available offset.对于结构化流式处理,可以将startingOffsets设置为earliest ,以便您每次消费时都从最早的可用偏移量开始。 The following will do the trick以下将做的伎俩

.option("startingOffsets", "earliest")

However note that this is effective just for newly created queries:但是请注意,这仅对新创建的查询有效:

startingOffsets

The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition.开始查询时的起点, "earliest"表示来自最早的偏移量, "latest"表示仅来自最新的偏移量,或者是指定每个 TopicPartition 的起始偏移量的 json 字符串。 In the json, -2 as an offset can be used to refer to earliest, -1 to latest.在 json 中, -2作为偏移量可用于表示最早, -1表示最晚。 Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed.注意:对于批量查询,最新(隐式或在 json 中使用-1 )是不允许的。 For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off .对于流式查询,这仅适用于新查询开始时,并且恢复将始终从查询停止的地方开始 Newly discovered partitions during a query will start at earliest.查询期间新发现的分区将最早开始。


Alternatively, you might also choose to change the consumer group every time:或者,您也可以选择每次更改消费者组:

.option("kafka.group.id", "newGroupID")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Streaming 和 Spark Structured Streaming 使用相同的微批处理引擎吗? - Do Spark Streaming and Spark Structured Streaming use same micro-batch engine? 控制微型批次的结构化火花流 - Control micro batch of Structured Spark Streaming Spark结构化流文件源开始偏移 - Spark Structured Streaming File Source Starting Offset Spark结构化流:将流与应在每个微型批次中读取的数据合并 - Spark Structured Streaming: join stream with data that should be read every micro batch 如何在一个微批量的 Spark 结构化流中设置批量大小 - How to set batch size in one micro-batch of spark structured streaming Spark 结构化流批量读取检查点 - Spark Structured Streaming Batch Read Checkpointing spark结构化流和批处理的相同接收器? - Same sink for spark structured streaming and batch? 从Kafka倒转偏移Spark结构化流 - Rewind Offset Spark Structured Streaming from Kafka 在火花流工作中,如何从执行程序收集错误消息到驱动程序并在每个流批处理结束时记录这些错误消息? - In spark streaming job, how to collect error messages from executors to drivers and log these at the end of each streaming batch? 如何从Spark结构化流媒体获取Kafka输出中的批次ID - How to get batch ID in Kafka output from Spark Structured Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM