PySpark 结构化流应用 udf 到 window

Question

我正在尝试将 pandas udf 应用于 pyspark 结构化 ZF7B8C74CBD96FBF2DE4C1A352702FBF4Z 的 window。 问题是，一旦 stream 赶上了当前的 state，所有新的 windows 只包含一个值。

正如您在屏幕截图中看到的，2019-10-22T15:34:08.730+0000 之后的所有 windows 仅包含一个值。 用于生成它的代码是这样的：

@pandas_udf("Count long, Resampled long, Start timestamp, End timestamp", PandasUDFType.GROUPED_MAP)
def myudf(df):
  df = df.dropna()
  df = df.set_index("Timestamp")
  df.sort_index(inplace=True)

  # resample the dataframe
  resampled = pd.DataFrame()
  oidx = df.index
  nidx = pd.date_range(oidx.min(), oidx.max(), freq="30S")
  resampled["Value"] = df.Value.reindex(oidx.union(nidx)).interpolate('index').reindex(nidx)
  return pd.DataFrame([[len(df.index), len(resampled.index), df.index.min(), df.index.max()]], columns=["Count", "Resampled", "Start", "End"])

predictionStream = sensorStream.withWatermark("Timestamp", "90 minutes").groupBy(col("Name"), window(col("Timestamp"), "70 minutes", "5 minutes"))

predictionStream.apply(myudf).writeStream \
    .queryName("aggregates") \
    .format("memory") \
    .start()

stream 确实每 5 分钟获取一次新值。 只是 window 不知何故只从最后一批中获取值，即使水印不应该过期。

有什么我做错了吗？ 我已经尝试过使用水印； 这对结果没有影响。 我需要 udf 中 window 的所有值。

我在设置为 5.5 LTS ML 的集群上的数据块中运行它（包括 Apache Spark 2.4.3、Scala 2.11）

Answer 1

看起来你可以为你 writeStream 指定你想要的 Output 模式

请参阅 Output 模式中的文档

默认情况下它使用 Append 模式：

这是默认模式，只有自上次触发后添加到结果表的新行才会输出到接收器。

尝试使用：

predictionStream.apply(myudf).writeStream \
.queryName("aggregates") \
.format("memory") \
.outputMode(OutputMode.Complete) \
.start()

PySpark 结构化流应用 udf 到 window

问题描述

1 个解决方案

解决方案1
1 2019-10-22 16:19:07

PySpark 结构化流应用 udf 到 window

问题描述

1 个解决方案

解决方案1 1 2019-10-22 16:19:07

解决方案1
1 2019-10-22 16:19:07