简体繁体中英

Handling data - Spark structured streaming

原文 2019-04-02 19:16:59 6 1 apache-spark/ pyspark/ apache-kafka/ spark-structured-streaming

As far as I know, spark structured streaming is fault Tolerance by using checkpoints.

I want to read from kafka.

So let's say that I use checkpoint, and then for some reason my code crashes / I stops it, then I expect that when I rerun the code it will recover the processed data.

My problem is that in the reading configuration, if I set the offset to earliest so after rerunning the code I will read the same data again, and if I put latest I won't read the data between the code crashes til I rerun the code.

Does there is a way to read only unread messages from kafka with spark 2.3 - structured streaming (pyspark), and to recover processed data from checkpoints?

1 answers

It depends where your code crashes. You don't need to set it earliest, you can set it to latest. You can always recover from checkpointing and reprocess the data, Here is the semantics of checkpointing

Spark Structured Streaming - kafka offset handling

Handling duplicates while processing Streaming data in Databricks Delta table with Spark Structured Streaming?

Structured Streaming spark.sql.streaming.schemaInference not handling schema changes

Loading data from Spark Structured Streaming into ArrayList

How to write Spark Structured Streaming Data into Hive?

Parquet data and partition issue in Spark Structured streaming

Spark Structured Streaming Kinesis Data source

Upsert data in postgresql using spark structured streaming

Spark Structured-Streaming - Watermark on not aggregated data

Writing data as JSON array with Spark Structured Streaming

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Spark Structured Streaming - kafka offset handling Handling duplicates while processing Streaming data in Databricks Delta table with Spark Structured Streaming? Structured Streaming spark.sql.streaming.schemaInference not handling schema changes Loading data from Spark Structured Streaming into ArrayList How to write Spark Structured Streaming Data into Hive? Parquet data and partition issue in Spark Structured streaming Spark Structured Streaming Kinesis Data source Upsert data in postgresql using spark structured streaming Spark Structured-Streaming - Watermark on not aggregated data Writing data as JSON array with Spark Structured Streaming

Related Tags

Handling data - Spark structured streaming

Question

1 answers

solution1 0 2019-04-02 22:24:55

solution1
0 2019-04-02 22:24:55