[英]New column with previous rows value
Im working with pyspark and i have frame like this 我和pyspark一起工作,我有这样的框架
this is my frame 这是我的框架
+---+-----+
| id|value|
+---+-----+
| 1| 65|
| 2| 66|
| 3| 65|
| 4| 68|
| 5| 71|
+---+-----+
and i want to generate frame with pyspark like this 我想像这样用pyspark生成框架
+---+-----+-------------+
| id|value| prev_value |
+---+-----+-------------+
| 1 | 65 | null |
| 2 | 66 | 65 |
| 3 | 65 | 66,65 |
| 4 | 68 | 65,66,65 |
| 5 | 71 | 68,65,66,65 |
+---+-----+-------------+
Here is one way: 这是一种方式:
from pyspark.sql.window import Window
from pyspark.sql.types import StringType
# define window and calculate "running total" of lagged value
win = Window.partitionBy().orderBy(f.col('id'))
df = df.withColumn('prev_value', f.collect_list(f.lag('value').over(win)).over(win))
# now define udf to concatenate the lists
concat = f.udf(lambda x: 'null' if len(x)==0 else ','.join([str(elt) for elt in x[::-1]]))
df = df.withColumn('prev_value', concat('prev_value'))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.