简体   繁体   中英

How do I read from same starting offset in each micro batch in spark structured streaming job?

I am using spark structured streaming. Is it possible to reset Kafka offset after every batch execution so that every batch read from same starting offset instead of only newly discovered events?

Quoting description of startingOffsets from spark Kafka integration documentation here

For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off. Newly discovered partitions during a query will start at earliest.

Right now I am doing it by creating a static dataframe from Kafka inside for each batch loop and using a dummy streaming dataset with format as "rate". Wondering if there is a better way to do it

For Structured Streaming can set startingOffsets to earliest so that every time you consume from the earliest available offset. The following will do the trick

.option("startingOffsets", "earliest")

However note that this is effective just for newly created queries:

startingOffsets

The start point when a query is started, either "earliest" which is from the earliest offsets, "latest" which is just from the latest offsets, or a json string specifying a starting offset for each TopicPartition. In the json, -2 as an offset can be used to refer to earliest, -1 to latest. Note: For batch queries, latest (either implicitly or by using -1 in json) is not allowed. For streaming queries, this only applies when a new query is started, and that resuming will always pick up from where the query left off . Newly discovered partitions during a query will start at earliest.


Alternatively, you might also choose to change the consumer group every time:

.option("kafka.group.id", "newGroupID")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM