简体繁体中英

Spark Structured Streaming Kafka Integration Offset management

原文 2018-09-15 08:16:46 7 1 scala/ apache-spark/ spark-structured-streaming

The documentation says:

enable.auto.commit: Kafka source doesn't commit any offset.

Hence my question is, in the event of a worker or partition crash/restart :

startingOffsets is set to latest, how do we not loose messages ?
startingOffsets is set to earliest, how do we not reprocess all messages ?

This is seems to be quite important. Any indication on how to deal with it ?

1 answers

I also ran into this issue.

You're right in your observations on the 2 options ie

potential data loss if startingOffsets is set to latest
duplicate data if startingOffsets is set to earliest

However...

There is the option of checkpointing by adding the following option:

.writeStream .<something else> .option("checkpointLocation", "path/to/HDFS/dir") .<something else>

In the event of a failure, Spark would go through the contents of this checkpoint directory, recover the state before accepting any new data.

I found this useful reference on the same.

Hope this helps!

DStream filtering and offset management in Spark Streaming Kafka

In Spark Structured streaming with Kafka, how spark manages offset for multiple topics

Spark Structured Streaming + Kafka Integration: MicroBatchExecution PartitionOffsets Error

Spark structured streaming of Kafka protobuf

Spark Structured Streaming with Hbase integration

Spark Streaming + Kafka Integration 0.8.2.1

structured streaming with Spark 2.0.2, Kafka source and scalapb

Spark Structured Streaming not restarting at Kafka offsets

Error when connecting spark structured streaming + kafka

KafkaUtils API | offset management | Spark Streaming

暂无

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question DStream filtering and offset management in Spark Streaming Kafka In Spark Structured streaming with Kafka, how spark manages offset for multiple topics Spark Structured Streaming + Kafka Integration: MicroBatchExecution PartitionOffsets Error Spark structured streaming of Kafka protobuf Spark Structured Streaming with Hbase integration Spark Streaming + Kafka Integration 0.8.2.1 structured streaming with Spark 2.0.2, Kafka source and scalapb Spark Structured Streaming not restarting at Kafka offsets Error when connecting spark structured streaming + kafka KafkaUtils API | offset management | Spark Streaming

Related Tags

粤ICP备18138465号 © 2020-2024 STACKOOM.COM