简体繁体 English

使用 State (Pyspark) 的 Spark 结构化流

[英]Spark Structured Streaming with State (Pyspark)

原文 2022-04-20 10:55:42 1 1 python/ apache-spark/ pyspark/ stream/ state

I want to match data with spark streaming based on a certain condition and I want to write this data to Kafka.我想根据特定条件将数据与火花流匹配，并且我想将此数据写入 Kafka。 By keeping the unmatched under a state and this state will keep a maximum of 2 days of data in hdfs.通过在 state 和此 state 下保持不匹配，将在 hdfs 中最多保留 2 天的数据。 Each new incoming data will try to match the unmatched data in this state.每个新的传入数据都将尝试匹配此 state 中的不匹配数据。 How can I use this state event?如何使用此 state 事件？ (I'm using pyspark) （我正在使用 pyspark）

1 个解决方案

Pyspark doesn't support stateful implementation by default . Pyspark doesn't support stateful implementation by default 。

Only Scala/Java API has this option using mapGroupsWithState function on KeyValueGroupedDataSet只有 Scala/Java API 在KeyValueGroupedDataSet上使用mapGroupsWithState function 具有此选项

But you can store 2 days of data in somewhere else ( file system or some no sql database ) and then for each new incoming data you can go to nosql database and fetch corresponding data and do the remaining stuff.但是您可以在其他地方存储 2 天的数据（文件系统或一些没有 sql 数据库），然后对于每个新传入数据，您可以 go 到 nosql 数据库并获取剩余的数据并获取相应的数据。

Spark Structured Streaming (Pyspark) 中的多个 writeStreams - Multiple writeStreams in Spark Structured Streaming (Pyspark)

使用Pyspark从结构化流数据框架构建Spark ML管道模型 - Building Spark ML pipeline model from structured streaming DataFrame using Pyspark

使Spark（Spark）的结构化流中的JSON可在python（pyspark）中作为无RDD的数据帧访问 - Make JSON in Spark's structured streaming accessible in python (pyspark) as dataframe without RDD

火花结构流：不正确写 - spark structured streaming: not writing correctly

使用结构化流 (PySpark) 运行链式查询 - Running chained queries using structured streaming (PySpark)

在结构化流中动态扩展 Arraytype() 列 PySpark - Dynamically Expand Arraytype() Columns in Structured Streaming PySpark

kafka 与 Pyspark 结构化流的集成 (Windows) - kafka integration with Pyspark structured streaming (Windows)

spark结构化流：镶木地板分区名称唯一性 - spark structured streaming: parquet partition name uniqueness

使用带有 Python 的 Spark 结构化流的字数 - Word Count using Spark Structured Streaming with Python

Spark Structured Streaming - 新批次上的空字典 - Spark Structured Streaming - Empty dictionary on new batch

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Structured Streaming (Pyspark) 中的多个 writeStreams - Multiple writeStreams in Spark Structured Streaming (Pyspark) 使用Pyspark从结构化流数据框架构建Spark ML管道模型 - Building Spark ML pipeline model from structured streaming DataFrame using Pyspark 使Spark（Spark）的结构化流中的JSON可在python（pyspark）中作为无RDD的数据帧访问 - Make JSON in Spark's structured streaming accessible in python (pyspark) as dataframe without RDD 火花结构流：不正确写 - spark structured streaming: not writing correctly 使用结构化流 (PySpark) 运行链式查询 - Running chained queries using structured streaming (PySpark) 在结构化流中动态扩展 Arraytype() 列 PySpark - Dynamically Expand Arraytype() Columns in Structured Streaming PySpark kafka 与 Pyspark 结构化流的集成 (Windows) - kafka integration with Pyspark structured streaming (Windows) spark结构化流：镶木地板分区名称唯一性 - spark structured streaming: parquet partition name uniqueness 使用带有 Python 的 Spark 结构化流的字数 - Word Count using Spark Structured Streaming with Python Spark Structured Streaming - 新批次上的空字典 - Spark Structured Streaming - Empty dictionary on new batch

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM