如何在 pyspark 結構化流中返回每組的最新行

Question

我有一個 stream ，我使用spark.readStream.format('delta')在 pyspark 中讀取它。 數據由多個列組成，包括type 、 date和value列。

示例 DataFrame；

類型	日期	價值
1	2020-01-21	6
1	2020-01-16	5
2	2020-01-20	8
2	2020-01-15	4

我想創建一個 DataFrame 來跟蹤每種類型的最新state 。 處理 static（批處理）數據時最簡單的方法之一是使用 windows，但不支持在非時間戳列上使用 windows。 另一種選擇看起來像

stream.groupby('type').agg(last('date'), last('value')).writeStream

但我認為 Spark 不能保證這里的排序，並且在聚合之前的結構化流中也不支持使用orderBy 。

您對如何應對這一挑戰有什么建議嗎？

Answer 1

simple use the to_timestamp() function that can be import by from pyspark.sql.functions import * on the date column so that you use the window function. 例如

from pyspark.sql.functions import *

df=spark.createDataFrame(
        data = [ ("1","2020-01-21")],
        schema=["id","input_timestamp"])
df.printSchema()

+---+---------------+-------------------+
|id |input_timestamp|timestamp          |
+---+---------------+-------------------+
|1  |2020-01-21     |2020-01-21 00:00:00|
+---+---------------+-------------------+

Answer 2

“但不支持在非時間戳列上使用 windows”你是從 stream 的角度這么說的，因為我也能做到。

這是您的問題的解決方案。

windowSpec  = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-16|    5|   1|
|   1|2020-01-21|    6|   2|
|   2|2020-01-15|    4|   1|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-21|    6|   2|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

如何在 pyspark 結構化流中返回每組的最新行

問題描述

2 個解決方案

解決方案1
0 2022-08-03 17:06:04

解決方案2
0 2022-08-04 10:23:50

如何在 pyspark 結構化流中返回每組的最新行

問題描述

2 個解決方案

解決方案1 0 2022-08-03 17:06:04

解決方案2 0 2022-08-04 10:23:50

解決方案1
0 2022-08-03 17:06:04

解決方案2
0 2022-08-04 10:23:50