简体   繁体   中英

How to return the latest rows per group in pyspark structured streaming

I have a stream which I read in pyspark using spark.readStream.format('delta') . The data consists of multiple columns including a type , date and value column.

Example DataFrame;

type date value
1 2020-01-21 6
1 2020-01-16 5
2 2020-01-20 8
2 2020-01-15 4

I would like to create a DataFrame that keeps track of the latest state per type. One of the most easy methods to do when working on static (batch) data is to use windows, but using windows on non-timestamp columns is not supported. Another option would look like

stream.groupby('type').agg(last('date'), last('value')).writeStream

but I think Spark cannot guarantee the ordering here, and using orderBy is also not supported in structured streaming before the aggrations.

Do you have any suggestions on how to approach this challenge?

simple use the to_timestamp() function that can be import by from pyspark.sql.functions import * on the date column so that you use the window function. eg

from pyspark.sql.functions import *

df=spark.createDataFrame(
        data = [ ("1","2020-01-21")],
        schema=["id","input_timestamp"])
df.printSchema()

+---+---------------+-------------------+
|id |input_timestamp|timestamp          |
+---+---------------+-------------------+
|1  |2020-01-21     |2020-01-21 00:00:00|
+---+---------------+-------------------+

"but using windows on non-timestamp columns is not supported" are you saying this from stream point of view, because same i am able to do.

Here is the solution to your problem.

windowSpec  = Window.partitionBy("type").orderBy("date")
df1=df.withColumn("rank",rank().over(windowSpec))
df1.show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-16|    5|   1|
|   1|2020-01-21|    6|   2|
|   2|2020-01-15|    4|   1|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

w = Window.partitionBy('type')
df1.withColumn('maxB', F.max('rank').over(w)).where(F.col('rank') == F.col('maxB')).drop('maxB').show()

+----+----------+-----+----+
|type|      date|value|rank|
+----+----------+-----+----+
|   1|2020-01-21|    6|   2|
|   2|2020-01-20|    8|   2|
+----+----------+-----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM