简体   繁体   中英

Spark streaming maintain kafka offset periodically as it processes

In spark streaming direct-approach from kafka, there is a way by which I can know the kafka offset level ranges. However if I would like periodically maintain offset level so that if needed I can reprocess items from a offset. Is there any way I can retrieve offset of a message in rdd while I am processing each message? Eg With offsetranges, I have start and end offset for the RDD, but what if while processing each record of the RDD system encounters and error and job ends. Now if I want to begin processing from the record that failed, how do I first save the last successful offset so that I can start with that when starting next time.

With Spark 1.3 released there is a new direct approach (no receiver) that hides this low-level complexity behind the scenes. In case of a failure and sufficient Kafka retention messages can be recovered from Kafka automatically after restart.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM