简体繁体 English

处理数据 - Spark结构化流媒体

[英]Handling data - Spark structured streaming

原文 2019-04-02 19:16:59 2 1 apache-spark/ pyspark/ apache-kafka/ spark-structured-streaming

As far as I know, spark structured streaming is fault Tolerance by using checkpoints. 据我所知，火花结构化流媒体是使用检查点的容错。

I want to read from kafka. 我想读卡夫卡。

So let's say that I use checkpoint, and then for some reason my code crashes / I stops it, then I expect that when I rerun the code it will recover the processed data. 所以，假设我使用检查点，然后由于某种原因我的代码崩溃/我停止它，然后我希望当我重新运行代码时它将恢复处理过的数据。

My problem is that in the reading configuration, if I set the offset to earliest so after rerunning the code I will read the same data again, and if I put latest I won't read the data between the code crashes til I rerun the code. 我的问题是，在读取配置中，如果我将偏移设置为最早，那么在重新运行代码后我将再次读取相同的数据，如果我放入最新的，我将无法读取代码崩溃之间的数据直到我重新运行代码。

Does there is a way to read only unread messages from kafka with spark 2.3 - structured streaming (pyspark), and to recover processed data from checkpoints? 有没有办法只使用spark 2.3 - 结构化流媒体（pyspark）读取来自kafka的未读消息，并从检查点恢复处理过的数据？

1 个解决方案

It depends where your code crashes. 这取决于您的代码崩溃的位置。 You don't need to set it earliest, you can set it to latest. 您不需要最早设置它，您可以将其设置为最新。 You can always recover from checkpointing and reprocess the data, Here is the semantics of checkpointing 您可以随时从检查点恢复并重新处理数据，这是检查点的语义

Spark Structured Streaming - kafka 偏移处理 - Spark Structured Streaming - kafka offset handling

使用 Spark 结构化流处理 Databricks Delta 表中的流数据时处理重复项？ - Handling duplicates while processing Streaming data in Databricks Delta table with Spark Structured Streaming?

结构化流 spark.sql.streaming.schemaInference 不处理架构更改 - Structured Streaming spark.sql.streaming.schemaInference not handling schema changes

将数据从 Spark Structured Streaming 加载到 ArrayList - Loading data from Spark Structured Streaming into ArrayList

如何将Spark结构化流数据写入Hive？ - How to write Spark Structured Streaming Data into Hive?

Spark结构化流中的镶木地板数据和分区问题 - Parquet data and partition issue in Spark Structured streaming

Spark Structured Streaming Kinesis 数据源 - Spark Structured Streaming Kinesis Data source

使用 Spark 结构化流在 postgresql 中插入数据 - Upsert data in postgresql using spark structured streaming

Spark Structured-Streaming - 未聚合数据上的水印 - Spark Structured-Streaming - Watermark on not aggregated data

使用 Spark 结构化流将数据写入 JSON 数组 - Writing data as JSON array with Spark Structured Streaming

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Streaming - kafka 偏移处理 - Spark Structured Streaming - kafka offset handling 使用 Spark 结构化流处理 Databricks Delta 表中的流数据时处理重复项？ - Handling duplicates while processing Streaming data in Databricks Delta table with Spark Structured Streaming? 结构化流 spark.sql.streaming.schemaInference 不处理架构更改 - Structured Streaming spark.sql.streaming.schemaInference not handling schema changes 将数据从 Spark Structured Streaming 加载到 ArrayList - Loading data from Spark Structured Streaming into ArrayList 如何将Spark结构化流数据写入Hive？ - How to write Spark Structured Streaming Data into Hive? Spark结构化流中的镶木地板数据和分区问题 - Parquet data and partition issue in Spark Structured streaming Spark Structured Streaming Kinesis 数据源 - Spark Structured Streaming Kinesis Data source 使用 Spark 结构化流在 postgresql 中插入数据 - Upsert data in postgresql using spark structured streaming Spark Structured-Streaming - 未聚合数据上的水印 - Spark Structured-Streaming - Watermark on not aggregated data 使用 Spark 结构化流将数据写入 JSON 数组 - Writing data as JSON array with Spark Structured Streaming

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM