[英]Window function with dynamic lag
I am looking at the window slide function for a Spark DataFrame in Spark SQL. 我正在查看Spark SQL中Spark DataFrame的窗口幻灯片功能。
I have a dataframe with columns id
, month
and volume
. 我有一个带有
id
, month
和volume
的数据框。
id month volume new_col
1 201601 100 0
1 201602 120 100
1 201603 450 220
1 201604 200 670
1 201605 121 870
Now I want to add a new column with the name new_col
, the value of new_col
is the sum of volume
and new_col
before the current row, as shown above. 现在我想用的名字添加一个新列
new_col
,价值new_col
是的总和volume
和new_col
当前行前,如上图所示。 The value of the new_col
first line will be zero. new_col
第一行的值将为零。
I tried below option to use the window function lag
by using PySpark. 我尝试通过使用PySpark使用窗口功能
lag
的以下选项。 But I found that the new_col
column will be recursively used. 但是我发现
new_col
列将被递归使用。 The way by only using lag
function can not do this: 仅使用
lag
函数的方法无法做到这一点:
window = Window.partitionBy(F.col('id')).orderBy(F.col('month').asc())
df.withColumn('new_col', F.lag(col('volume'), 1).over(window) + F.lag(col('new_col'), 1).over(window))
Is there a way to dynamically lag the new_col
by using window functions? 有没有一种方法可以通过使用窗口函数来动态滞后
new_col
? Or are there any other good solutions? 还是有其他好的解决方案?
You can use lag
and sum
over a window to achieve this. 您可以在一个窗口上使用
lag
和sum
来实现此目的。 sum
will automatically compute the cumsum if used over a window. sum
,如果用在窗口会自动计算cumsum。 The below code will first lag the volume
column and then take its cumsum but doing the operations in the opposite order is also possible. 下面的代码将首先滞后于
volume
列,然后求和,但也可以按相反的顺序进行操作。
window = Window.partitionBy(F.col('id')).orderBy(F.col('month').asc())
df.withColumn('new_col', F.sum(F.lag(col('volume'), 1, 0).over(window)).over(window))
You can use nested window functions 您可以使用嵌套窗口功能
>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>>
>>> data = sc.parallelize([
... (1,'201601',100),
... (1,'201602',120),
... (1,'201603',450),
... (1,'201604',200),
... (1,'201605',121)])
>>> col = ['id','month', 'volume']
>>>
>>> df = spark.createDataFrame(data, col)
>>> df.show()
+---+------+------+
| id| month|volume|
+---+------+------+
| 1|201601| 100|
| 1|201602| 120|
| 1|201603| 450|
| 1|201604| 200|
| 1|201605| 121|
+---+------+------+
>>> window1 = Window.partitionBy('id').orderBy('month')
>>> window2 = Window.partitionBy('id').orderBy('month').rangeBetween(Window.unboundedPreceding, 0)
>>> df = df.withColumn('new_col', F.sum(F.lag('volume').over(window1)).over(window2)).na.fill({'new_col': 0})
>>> df.show()
+---+------+------+-------+
| id| month|volume|new_col|
+---+------+------+-------+
| 1|201601| 100| 0|
| 1|201602| 120| 100|
| 1|201603| 450| 220|
| 1|201604| 200| 670|
| 1|201605| 121| 870|
+---+------+------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.