I have requirement to replace with previous record value, So I have implemented this using window function but i want to improve performance. Could you please advise if there is any other alternative approach.
from pyspark.sql import SparkSession, Window, DataFrame
from pyspark.sql.types import *
from pyspark.sql import functions as F
source = [(1,2,3),(2,3,4),(1,3,4)]
target = [(1,3,1),(3,4,1)]
schema = ['key','col1','col2']
source_df = spark.createDataFrame(source, schema=schema)
target_df = spark.createDataFrame(source, schema=schema)
df = source_df.unionAll(target_df)
window = Window.partitionBy(F.col('key')).orderBy(F.col('col2').asc())
df = df.withColumn('col1_prev', F.lag(F.col('col1_start')).over(window)\
.withColumn('col1', F.lit('col1_next'))
df.show()
1,3,1
1,2,1
1,3,3
2,3,4
3,4,1
You could use the last
function in a specified interval, say the last 2 rows in the window. I will set it as maxsize
here as an example:
import sys
window = Window.partitionBy('key')\
.orderBy('col2')\
.rowsBetween(-sys.maxsize, -1)
df = F.last(df['col1_prev'], ignorenulls=True).over(window)
I hope it could resolve your problem.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.