简体   繁体   English

使用 State (Pyspark) 的 Spark 结构化流

[英]Spark Structured Streaming with State (Pyspark)

I want to match data with spark streaming based on a certain condition and I want to write this data to Kafka.我想根据特定条件将数据与火花流匹配,并且我想将此数据写入 Kafka。 By keeping the unmatched under a state and this state will keep a maximum of 2 days of data in hdfs.通过在 state 和此 state 下保持不匹配,将在 hdfs 中最多保留 2 天的数据。 Each new incoming data will try to match the unmatched data in this state.每个新的传入数据都将尝试匹配此 state 中的不匹配数据。 How can I use this state event?如何使用此 state 事件? (I'm using pyspark) (我正在使用 pyspark)

Pyspark doesn't support stateful implementation by default . Pyspark doesn't support stateful implementation by default

Only Scala/Java API has this option using mapGroupsWithState function on KeyValueGroupedDataSet只有 Scala/Java API 在KeyValueGroupedDataSet上使用mapGroupsWithState function 具有此选项

But you can store 2 days of data in somewhere else ( file system or some no sql database ) and then for each new incoming data you can go to nosql database and fetch corresponding data and do the remaining stuff.但是您可以在其他地方存储 2 天的数据(文件系统或一些没有 sql 数据库),然后对于每个新传入数据,您可以 go 到 nosql 数据库并获取剩余的数据并获取相应的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM