简体   繁体   中英

Rewind Offset Spark Structured Streaming from Kafka

I am using spark structured streaming (2.2.1) to consume a topic from Kafka (0.10).

 val df = spark
      .readStream
      .format("kafka")
      .option("kafka.bootstrap.servers", fromKafkaServers)
      .option("subscribe", topicName)
      .option("startingOffset", "earliest")
      .load()

My checkpoint location is set on an external HDFS dir. In some cases, I would like to restart the streaming application and consume data from the beginning. However, even though I delete all the checkpointing data from the HDFS dir and resubmit the jar, Spark is still able to find my last consumed offset and resume from there. Where else does the offset live? I suspect it is related to Kafka Consumer Id. However, I am unable to set group.id with spark structured streaming per Spark Doc and it seems like all applications subscribing to the same topic gets assigned to one consumer group. What if I would like to have two independent streaming job running that subscribes to the same topic?

你有错字:)它是startingOffsets

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM