[英]Spark add new column to dataframe with value from previous row
I'm wondering how I can achieve the following in Spark (Pyspark) 我想知道如何在Spark(Pyspark)中实现以下目标
Initial Dataframe: 初始数据框:
+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+
Resulting Dataframe: 结果数据框:
+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0| 7.0 |
+--+---+-------+
|3 |7.0| 3.0 |
+--+---+-------+
|2 |3.0| 5.0 |
+--+---+-------+
I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10)
我通常使用类似df.withColumn("new_Col", df.num * 10)
将新列“追加”到数据df.withColumn("new_Col", df.num * 10)
However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). 但是,我不知道如何为新列实现这种“行移位”,以便新列具有上一行的字段值(如示例中所示)。 I also couldn't find anything in the API documentation on how to access a certain row in a DF by index. 我也无法在API文档中找到有关如何通过索引访问DF中特定行的任何内容。
Any help would be appreciated. 任何帮助,将不胜感激。
You can use lag
window function as follows 您可以如下使用lag
窗口功能
from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window
df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()
## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## | 2|3.0| 5.0|
## | 3|7.0| 3.0|
## | 4|9.0| 7.0|
## +---+---+-------+
but there some important issues: 但是有一些重要的问题:
While the second issue is almost never a problem the first one can be a deal-breaker. 尽管第二个问题几乎从来都不是问题,但第一个问题可以成为破坏交易的方法。 If this is the case you should simply convert your DataFrame
to RDD and compute lag
manually. 如果是这种情况,您应该简单地将DataFrame
转换为RDD并手动计算lag
。 See for example: 参见例如:
Other useful links: 其他有用的链接:
val df = sc.parallelize(Seq((4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0))).toDF("id", "num")
df.show
+---+---+
| id|num|
+---+---+
| 4|9.0|
| 3|7.0|
| 2|3.0|
| 1|5.0|
+---+---+
df.withColumn("new_column", lag("num", 1, 0).over(w)).show
+---+---+----------+
| id|num|new_column|
+---+---+----------+
| 1|5.0| 0.0|
| 2|3.0| 5.0|
| 3|7.0| 3.0|
| 4|9.0| 7.0|
+---+---+----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.