简体   繁体   中英

Spark Structured Streaming Kafka Integration Offset management

The documentation says:

enable.auto.commit: Kafka source doesn't commit any offset.

Hence my question is, in the event of a worker or partition crash/restart :

  1. startingOffsets is set to latest, how do we not loose messages ?
  2. startingOffsets is set to earliest, how do we not reprocess all messages ?

This is seems to be quite important. Any indication on how to deal with it ?

I also ran into this issue.

You're right in your observations on the 2 options ie

  • potential data loss if startingOffsets is set to latest
  • duplicate data if startingOffsets is set to earliest

However...

There is the option of checkpointing by adding the following option:

.writeStream .<something else> .option("checkpointLocation", "path/to/HDFS/dir") .<something else>

In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.

I found this useful reference on the same.

Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM