[英]How to return the latest rows per group in pyspark structured streaming
I have a stream which I read in pyspark using spark.readStream.format('delta')
.我有一个 stream ,我使用
spark.readStream.format('delta')
在 pyspark 中读取它。 The data consists of multiple columns including a type
, date
and value
column.数据由多个列组成,包括
type
、 date
和value
列。
Example DataFrame;示例 DataFrame;
type![]() |
date![]() |
value![]() |
---|---|---|
1 ![]() |
2020-01-21 ![]() |
6 ![]() |
1 ![]() |
2020-01-16 ![]() |
5 ![]() |
2 ![]() |
2020-01-20 ![]() |
8 ![]() |
2 ![]() |
2020-01-15 ![]() |
4 ![]() |
I would like to create a DataFrame that keeps track of the latest state
per type.我想创建一个 DataFrame 来跟踪每种类型的最新
state
。 One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported.处理 static(批处理)数据时最简单的方法之一是使用 windows,但不支持在非时间戳列上使用 windows。 Another option would look like
另一种选择看起来像
stream.groupby('type').agg(last('date'), last('value')).writeStream
but I think Spark cannot guarantee the ordering here, and using orderBy
is also not supported in structured streaming before the aggrations.但我认为 Spark 不能保证这里的排序,并且在聚合之前的结构化流中也不支持使用
orderBy
。
Do you have any suggestions on how to approach this challenge?您对如何应对这一挑战有什么建议吗?
simple use the to_timestamp() function that can be import by from pyspark.sql.functions import *
on the date column so that you use the window function. simple use the to_timestamp() function that can be import by
from pyspark.sql.functions import *
on the date column so that you use the window function. eg例如
from pyspark.sql.functions import *
df=spark.createDataFrame(
data = [ ("1","2020-01-21")],
schema=["id","input_timestamp"])
df.printSchema()
+---+---------------+-------------------+
|id |input_timestamp|timestamp |
+---+---------------+-------------------+
|1 |2020-01-21 |2020-01-21 00:00:00|
+---+---------------+-------------------+
"but using windows on non-timestamp columns is not supported" are you saying this from stream point of view, because same i am able to do. “但不支持在非时间戳列上使用 windows”你是从 stream 的角度这么说的,因为我也能做到。
Here is the solution to your problem.这是您的问题的解决方案。
windowSpec = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-16| 5| 1|
| 1|2020-01-21| 6| 2|
| 2|2020-01-15| 4| 1|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()
+----+----------+-----+----+
|type| date|value|rank|
+----+----------+-----+----+
| 1|2020-01-21| 6| 2|
| 2|2020-01-20| 8| 2|
+----+----------+-----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.