Spark使用上一行的值将新列添加到数据框

Question

I'm wondering how I can achieve the following in Spark (Pyspark) 我想知道如何在Spark（Pyspark）中实现以下目标

Initial Dataframe: 初始数据框：

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+

Resulting Dataframe: 结果数据框：

+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0|  7.0  |
+--+---+-------+
|3 |7.0|  3.0  |
+--+---+-------+
|2 |3.0|  5.0  |
+--+---+-------+

I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10) 我通常使用类似df.withColumn("new_Col", df.num * 10)将新列“追加”到数据df.withColumn("new_Col", df.num * 10)

However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). 但是，我不知道如何为新列实现这种“行移位”，以便新列具有上一行的字段值（如示例中所示）。 I also couldn't find anything in the API documentation on how to access a certain row in a DF by index. 我也无法在API文档中找到有关如何通过索引访问DF中特定行的任何内容。

Any help would be appreciated. 任何帮助，将不胜感激。

Answer 1

You can use lag window function as follows 您可以如下使用lag窗口功能

from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()

## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## |  2|3.0|    5.0|
## |  3|7.0|    3.0|
## |  4|9.0|    7.0|
## +---+---+-------+

but there some important issues: 但是有一些重要的问题：

if you need a global operation (not partitioned by some other column / columns) it is extremely inefficient. 如果您需要全局操作（不被其他一个或多个其他列分区），则效率极低。
you need a natural way to order your data. 您需要一种自然的方式来订购数据。

While the second issue is almost never a problem the first one can be a deal-breaker. 尽管第二个问题几乎从来都不是问题，但第一个问题可以成为破坏交易的方法。 If this is the case you should simply convert your DataFrame to RDD and compute lag manually. 如果是这种情况，您应该简单地将DataFrame转换为RDD并手动计算lag 。 See for example: 参见例如：

How to transform data with sliding window over time series data in Pyspark 如何在Pyspark中的时间序列数据上使用滑动窗口转换数据
Apache Spark Moving Average (written in Scala, but can be adjusted for PySpark. Be sure to read the comments first). Apache Spark移动平均（用Scala编写，但可以针对PySpark进行调整。请务必先阅读注释）。

Other useful links: 其他有用的链接：

Answer 2

   val df = sc.parallelize(Seq((4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0))).toDF("id", "num")
df.show
+---+---+
| id|num|
+---+---+
|  4|9.0|
|  3|7.0|
|  2|3.0|
|  1|5.0|
+---+---+
df.withColumn("new_column", lag("num", 1, 0).over(w)).show
+---+---+----------+
| id|num|new_column|
+---+---+----------+
|  1|5.0|       0.0|
|  2|3.0|       5.0|
|  3|7.0|       3.0|
|  4|9.0|       7.0|
+---+---+----------+

Spark使用上一行的值将新列添加到数据框

问题描述

2 个解决方案

解决方案1
36 已采纳 2015-12-15 17:48:55

解决方案2
-1 2018-10-15 11:02:09

Spark使用上一行的值将新列添加到数据框

问题描述

2 个解决方案

解决方案1 36 已采纳 2015-12-15 17:48:55

解决方案2 -1 2018-10-15 11:02:09

解决方案1
36 已采纳 2015-12-15 17:48:55

解决方案2
-1 2018-10-15 11:02:09