[英]How do I read from same starting offset in each micro batch in spark structured streaming job?
I am using spark structured streaming.我正在使用火花结构化流。 Is it possible to reset Kafka offset after every batch execution so that every batch read from same starting offset instead of only newly discovered events?是否可以在每次批处理执行后重置 Kafka 偏移量,以便每个批处理都从相同的起始偏移量读取,而不仅仅是新发现的事件?
Quoting description of startingOffsets
from spark Kafka integration documentation here 在此处引用 spark Kafka 集成文档中的startingOffsets
描述
For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off.对于流式查询,这仅适用于新查询开始时,并且恢复将始终从查询停止的地方开始。 Newly discovered partitions during a query will start at earliest.查询期间新发现的分区将最早开始。
Right now I am doing it by creating a static dataframe from Kafka inside for each batch loop and using a dummy streaming dataset with format as "rate".现在,我正在通过为每个批处理循环从 Kafka 内部创建 static dataframe 并使用格式为“rate”的虚拟流数据集来做到这一点。 Wondering if there is a better way to do it想知道是否有更好的方法来做到这一点
For Structured Streaming can set startingOffsets
to earliest
so that every time you consume from the earliest available offset.对于结构化流式处理,可以将startingOffsets
设置为earliest
,以便您每次消费时都从最早的可用偏移量开始。 The following will do the trick以下将做的伎俩
.option("startingOffsets", "earliest")
However note that this is effective just for newly created queries:但是请注意,这仅对新创建的查询有效:
The start point when a query is started, either
"earliest"
which is from the earliest offsets,"latest"
which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition.开始查询时的起点,"earliest"
表示来自最早的偏移量,"latest"
表示仅来自最新的偏移量,或者是指定每个 TopicPartition 的起始偏移量的 json 字符串。 In the json,-2
as an offset can be used to refer to earliest,-1
to latest.在 json 中,-2
作为偏移量可用于表示最早,-1
表示最晚。 Note: For batch queries, latest (either implicitly or by using-1
in json) is not allowed.注意:对于批量查询,最新(隐式或在 json 中使用-1
)是不允许的。 For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off .对于流式查询,这仅适用于新查询开始时,并且恢复将始终从查询停止的地方开始。 Newly discovered partitions during a query will start at earliest.查询期间新发现的分区将最早开始。
Alternatively, you might also choose to change the consumer group every time:或者,您也可以选择每次更改消费者组:
.option("kafka.group.id", "newGroupID")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.