[英]Perform Lag over multiple columns using PySpark
I'm fairly new to PySpark, but I am trying to use best practices in my code.我对 PySpark 相当陌生,但我正在尝试在我的代码中使用最佳实践。 I have a PySpark dataframe and I would like to lag multiple columns, replacing the original values with the lagged values.我有一个 PySpark dataframe 并且我想滞后多个列,用滞后值替换原始值。 Example:例子:
ID date value1 value2 value3
1 2021-12-23 1.1 4.0 2.2
2 2021-12-21 2.4 1.6 11.9
1 2021-12-24 5.4 3.2 7.8
2 2021-12-22 4.2 1.4 9.0
1 2021-12-26 2.3 5.2 7.6
.
.
.
I'd like to take all values according to ID
, order them by date
, then lag the values by some amount.我想根据ID
获取所有值,按date
排序,然后将值滞后一些。 The code I have so far:我到目前为止的代码:
from pyspark.sql import functions as F, Window
window = Window.partitionBy(F.col("ID")).orderBy(F.col("date"))
valueColumns = ['value1', 'value2', 'value3']
df = F.lag(valueColumns, offset=shiftAmount).over(window)
My desired output would be:我想要的 output 将是:
ID date value1 value2 value3
1 2021-12-23 Null Null Null
2 2021-12-21 Null Null Null
1 2021-12-24 1.1 4.0 2.2
2 2021-12-22 2.4 1.6 11.9
1 2021-12-26 5.4 3.2 7.86
.
.
.
The problem I'm having is that, from what I can find, F.lag
only accepts a single column.我遇到的问题是,据我所知, F.lag
只接受一列。 I'm looking for suggestions on how to best accomplish this.我正在寻找有关如何最好地完成此任务的建议。 I suppose I could use a for loop to just append shifted columns or something, but this seems pretty inelegant.我想我可以使用 for 循环来仅 append 移动列或其他东西,但这似乎很不雅。 Thanks!谢谢!
A simple list comprehension on column names should do the job:对列名的简单列表理解应该可以完成这项工作:
df = df.select(
"ID", "date",
*[F.lag(c, offset=shiftAmount).over(window).alias(c) for c in valueColumns]
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.