简体   繁体   English

处理数据 - Spark结构化流媒体

[英]Handling data - Spark structured streaming

As far as I know, spark structured streaming is fault Tolerance by using checkpoints. 据我所知,火花结构化流媒体是使用检查点的容错。

I want to read from kafka. 我想读卡夫卡。

So let's say that I use checkpoint, and then for some reason my code crashes / I stops it, then I expect that when I rerun the code it will recover the processed data. 所以,假设我使用检查点,然后由于某种原因我的代码崩溃/我停止它,然后我希望当我重新运行代码时它将恢复处理过的数据。

My problem is that in the reading configuration, if I set the offset to earliest so after rerunning the code I will read the same data again, and if I put latest I won't read the data between the code crashes til I rerun the code. 我的问题是,在读取配置中,如果我将偏移设置为最早,那么在重新运行代码后我将再次读取相同的数据,如果我放入最新的,我将无法读取代码崩溃之间的数据直到我重新运行代码。

Does there is a way to read only unread messages from kafka with spark 2.3 - structured streaming (pyspark), and to recover processed data from checkpoints? 有没有办法只使用spark 2.3 - 结构化流媒体(pyspark)读取来自kafka的未读消息,并从检查点恢复处理过的数据?

It depends where your code crashes. 这取决于您的代码崩溃的位置。 You don't need to set it earliest, you can set it to latest. 您不需要最早设置它,您可以将其设置为最新。 You can always recover from checkpointing and reprocess the data, Here is the semantics of checkpointing 您可以随时从检查点恢复并重新处理数据, 这是检查点的语义

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM