具有动态滞后的窗口功能

Question

I am looking at the window slide function for a Spark DataFrame in Spark SQL. 我正在查看Spark SQL中Spark DataFrame的窗口幻灯片功能。

I have a dataframe with columns id , month and volume . 我有一个带有id ， month和volume的数据框。

id       month   volume new_col
1        201601  100     0
1        201602  120   100
1        201603  450   220
1        201604  200   670
1        201605  121   870

Now I want to add a new column with the name new_col , the value of new_col is the sum of volume and new_col before the current row, as shown above. 现在我想用的名字添加一个新列new_col ，价值new_col是的总和volume和new_col当前行前，如上图所示。 The value of the new_col first line will be zero. new_col第一行的值将为零。

I tried below option to use the window function lag by using PySpark. 我尝试通过使用PySpark使用窗口功能lag的以下选项。 But I found that the new_col column will be recursively used. 但是我发现new_col列将被递归使用。 The way by only using lag function can not do this: 仅使用lag函数的方法无法做到这一点：

window = Window.partitionBy(F.col('id')).orderBy(F.col('month').asc())
df.withColumn('new_col', F.lag(col('volume'), 1).over(window) + F.lag(col('new_col'), 1).over(window))

Is there a way to dynamically lag the new_col by using window functions? 有没有一种方法可以通过使用窗口函数来动态滞后new_col ？ Or are there any other good solutions? 还是有其他好的解决方案？

Answer 1

You can use lag and sum over a window to achieve this. 您可以在一个窗口上使用lag和sum来实现此目的。 sum will automatically compute the cumsum if used over a window. sum ，如果用在窗口会自动计算cumsum。 The below code will first lag the volume column and then take its cumsum but doing the operations in the opposite order is also possible. 下面的代码将首先滞后于volume列，然后求和，但也可以按相反的顺序进行操作。

window = Window.partitionBy(F.col('id')).orderBy(F.col('month').asc())
df.withColumn('new_col', F.sum(F.lag(col('volume'), 1, 0).over(window)).over(window))

Answer 2

You can use nested window functions 您可以使用嵌套窗口功能

>>> from pyspark.sql.window import Window
>>> import pyspark.sql.functions as F
>>> 
>>> data = sc.parallelize([
...     (1,'201601',100),
...     (1,'201602',120),
...     (1,'201603',450),
...     (1,'201604',200),
...     (1,'201605',121)])
>>> col = ['id','month', 'volume']
>>> 
>>> df = spark.createDataFrame(data, col)
>>> df.show()
+---+------+------+
| id| month|volume|
+---+------+------+
|  1|201601|   100|
|  1|201602|   120|
|  1|201603|   450|
|  1|201604|   200|
|  1|201605|   121|
+---+------+------+

>>> window1 = Window.partitionBy('id').orderBy('month')
>>> window2 = Window.partitionBy('id').orderBy('month').rangeBetween(Window.unboundedPreceding, 0)
>>> df = df.withColumn('new_col', F.sum(F.lag('volume').over(window1)).over(window2)).na.fill({'new_col': 0})
>>> df.show()
+---+------+------+-------+                                                     
| id| month|volume|new_col|
+---+------+------+-------+
|  1|201601|   100|      0|
|  1|201602|   120|    100|
|  1|201603|   450|    220|
|  1|201604|   200|    670|
|  1|201605|   121|    870|
+---+------+------+-------+

具有动态滞后的窗口功能

问题描述

2 个解决方案

解决方案1
1 2018-09-27 08:07:10

解决方案2
1 2018-09-27 20:55:47

具有动态滞后的窗口功能

问题描述

2 个解决方案

解决方案1 1 2018-09-27 08:07:10

解决方案2 1 2018-09-27 20:55:47

解决方案1
1 2018-09-27 08:07:10

解决方案2
1 2018-09-27 20:55:47