[英]Spark Structured Streaming with State (Pyspark)
I want to match data with spark streaming based on a certain condition and I want to write this data to Kafka.我想根据特定条件将数据与火花流匹配,并且我想将此数据写入 Kafka。 By keeping the unmatched under a state and this state will keep a maximum of 2 days of data in hdfs.
通过在 state 和此 state 下保持不匹配,将在 hdfs 中最多保留 2 天的数据。 Each new incoming data will try to match the unmatched data in this state.
每个新的传入数据都将尝试匹配此 state 中的不匹配数据。 How can I use this state event?
如何使用此 state 事件? (I'm using pyspark)
(我正在使用 pyspark)
Pyspark doesn't support stateful implementation by default
. Pyspark
doesn't support stateful implementation by default
。
Only Scala/Java API has this option using mapGroupsWithState
function on KeyValueGroupedDataSet
只有 Scala/Java API 在
KeyValueGroupedDataSet
上使用mapGroupsWithState
function 具有此选项
But you can store 2 days of data in somewhere else ( file system or some no sql database ) and then for each new incoming data you can go to nosql database and fetch corresponding data and do the remaining stuff.但是您可以在其他地方存储 2 天的数据(文件系统或一些没有 sql 数据库),然后对于每个新传入数据,您可以 go 到 nosql 数据库并获取剩余的数据并获取相应的数据。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.