简体   繁体   中英

Handling data - Spark structured streaming

As far as I know, spark structured streaming is fault Tolerance by using checkpoints.

I want to read from kafka.

So let's say that I use checkpoint, and then for some reason my code crashes / I stops it, then I expect that when I rerun the code it will recover the processed data.

My problem is that in the reading configuration, if I set the offset to earliest so after rerunning the code I will read the same data again, and if I put latest I won't read the data between the code crashes til I rerun the code.

Does there is a way to read only unread messages from kafka with spark 2.3 - structured streaming (pyspark), and to recover processed data from checkpoints?

It depends where your code crashes. You don't need to set it earliest, you can set it to latest. You can always recover from checkpointing and reprocess the data, Here is the semantics of checkpointing

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM