如何在 pyspark 结构化流中返回每组的最新行

Question

I have a stream which I read in pyspark using spark.readStream.format('delta') .我有一个 stream ，我使用spark.readStream.format('delta')在 pyspark 中读取它。 The data consists of multiple columns including a type , date and value column.数据由多个列组成，包括type 、 date和value列。

Example DataFrame;示例 DataFrame；

type类型	date日期	value价值
1 1	2020-01-21 2020-01-21	6 6
1 1	2020-01-16 2020-01-16	5 5
2 2	2020-01-20 2020-01-20	8 8
2 2	2020-01-15 2020-01-15	4 4

I would like to create a DataFrame that keeps track of the latest state per type.我想创建一个 DataFrame 来跟踪每种类型的最新state 。 One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported.处理 static（批处理）数据时最简单的方法之一是使用 windows，但不支持在非时间戳列上使用 windows。 Another option would look like另一种选择看起来像

stream.groupby('type').agg(last('date'), last('value')).writeStream

but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.但我认为 Spark 不能保证这里的排序，并且在聚合之前的结构化流中也不支持使用orderBy 。

Do you have any suggestions on how to approach this challenge?您对如何应对这一挑战有什么建议吗？

Answer 1

simple use the to_timestamp() function that can be import by from pyspark.sql.functions import * on the date column so that you use the window function. simple use the to_timestamp() function that can be import by from pyspark.sql.functions import * on the date column so that you use the window function. eg例如

from pyspark.sql.functions import *

df=spark.createDataFrame(
        data = [ ("1","2020-01-21")],
        schema=["id","input_timestamp"])
df.printSchema()

+---+---------------+-------------------+
|id |input_timestamp|timestamp          |
+---+---------------+-------------------+
|1  |2020-01-21     |2020-01-21 00:00:00|
+---+---------------+-------------------+

Answer 2

"but using windows on non-timestamp columns is not supported" are you saying this from stream point of view, because same i am able to do. “但不支持在非时间戳列上使用 windows”你是从 stream 的角度这么说的，因为我也能做到。

Here is the solution to your problem.这是您的问题的解决方案。

windowSpec  = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-16|    5|   1|
|   1|2020-01-21|    6|   2|
|   2|2020-01-15|    4|   1|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-21|    6|   2|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

如何在 pyspark 结构化流中返回每组的最新行

问题描述

2 个解决方案

解决方案1
0 2022-08-03 17:06:04

解决方案2
0 2022-08-04 10:23:50

如何在 pyspark 结构化流中返回每组的最新行

问题描述

2 个解决方案

解决方案1 0 2022-08-03 17:06:04

解决方案2 0 2022-08-04 10:23:50

解决方案1
0 2022-08-03 17:06:04

解决方案2
0 2022-08-04 10:23:50