简体   繁体   English

如何在 pyspark 结构化流中返回每组的最新行

[英]How to return the latest rows per group in pyspark structured streaming

I have a stream which I read in pyspark using spark.readStream.format('delta') .我有一个 stream ,我使用spark.readStream.format('delta')在 pyspark 中读取它。 The data consists of multiple columns including a type , date and value column.数据由多个列组成,包括typedatevalue列。

Example DataFrame;示例 DataFrame;

type类型 date日期 value价值
1 1 2020-01-21 2020-01-21 6 6
1 1 2020-01-16 2020-01-16 5 5
2 2 2020-01-20 2020-01-20 8 8
2 2 2020-01-15 2020-01-15 4 4

I would like to create a DataFrame that keeps track of the latest state per type.我想创建一个 DataFrame 来跟踪每种类型的最新state One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported.处理 static(批处理)数据时最简单的方法之一是使用 windows,但不支持在非时间戳列上使用 windows。 Another option would look like另一种选择看起来像

stream.groupby('type').agg(last('date'), last('value')).writeStream

but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.但我认为 Spark 不能保证这里的排序,并且在聚合之前的结构化流中也不支持使用orderBy

Do you have any suggestions on how to approach this challenge?您对如何应对这一挑战有什么建议吗?

simple use the to_timestamp() function that can be import by from pyspark.sql.functions import * on the date column so that you use the window function. simple use the to_timestamp() function that can be import by from pyspark.sql.functions import * on the date column so that you use the window function. eg例如

from pyspark.sql.functions import *

df=spark.createDataFrame(
        data = [ ("1","2020-01-21")],
        schema=["id","input_timestamp"])
df.printSchema()

+---+---------------+-------------------+
|id |input_timestamp|timestamp          |
+---+---------------+-------------------+
|1  |2020-01-21     |2020-01-21 00:00:00|
+---+---------------+-------------------+

"but using windows on non-timestamp columns is not supported" are you saying this from stream point of view, because same i am able to do. “但不支持在非时间戳列上使用 windows”你是从 stream 的角度这么说的,因为我也能做到。

Here is the solution to your problem.这是您的问题的解决方案。

windowSpec  = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-16|    5|   1|
|   1|2020-01-21|    6|   2|
|   2|2020-01-15|    4|   1|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-21|    6|   2|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM