简体   繁体   English

Spark使用上一行的值将新列添加到数据框

[英]Spark add new column to dataframe with value from previous row

I'm wondering how I can achieve the following in Spark (Pyspark) 我想知道如何在Spark(Pyspark)中实现以下目标

Initial Dataframe: 初始数据框:

+--+---+
|id|num|
+--+---+
|4 |9.0|
+--+---+
|3 |7.0|
+--+---+
|2 |3.0|
+--+---+
|1 |5.0|
+--+---+

Resulting Dataframe: 结果数据框:

+--+---+-------+
|id|num|new_Col|
+--+---+-------+
|4 |9.0|  7.0  |
+--+---+-------+
|3 |7.0|  3.0  |
+--+---+-------+
|2 |3.0|  5.0  |
+--+---+-------+

I manage to generally "append" new columns to a dataframe by using something like: df.withColumn("new_Col", df.num * 10) 我通常使用类似df.withColumn("new_Col", df.num * 10)将新列“追加”到数据df.withColumn("new_Col", df.num * 10)

However I have no idea on how I can achieve this "shift of rows" for the new column, so that the new column has the value of a field from the previous row (as shown in the example). 但是,我不知道如何为新列实现这种“行移位”,以便新列具有上一行的字段值(如示例中所示)。 I also couldn't find anything in the API documentation on how to access a certain row in a DF by index. 我也无法在API文档中找到有关如何通过索引访问DF中特定行的任何内容。

Any help would be appreciated. 任何帮助,将不胜感激。

You can use lag window function as follows 您可以如下使用lag窗口功能

from pyspark.sql.functions import lag, col
from pyspark.sql.window import Window

df = sc.parallelize([(4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0)]).toDF(["id", "num"])
w = Window().partitionBy().orderBy(col("id"))
df.select("*", lag("num").over(w).alias("new_col")).na.drop().show()

## +---+---+-------+
## | id|num|new_col|
## +---+---+-------|
## |  2|3.0|    5.0|
## |  3|7.0|    3.0|
## |  4|9.0|    7.0|
## +---+---+-------+

but there some important issues: 但是有一些重要的问题:

  1. if you need a global operation (not partitioned by some other column / columns) it is extremely inefficient. 如果您需要全局操作(不被其他一个或多个其他列分区),则效率极低。
  2. you need a natural way to order your data. 您需要一种自然的方式来订购数据。

While the second issue is almost never a problem the first one can be a deal-breaker. 尽管第二个问题几乎从来都不是问题,但第一个问题可以成为破坏交易的方法。 If this is the case you should simply convert your DataFrame to RDD and compute lag manually. 如果是这种情况,您应该简单地将DataFrame转换为RDD并手动计算lag See for example: 参见例如:

Other useful links: 其他有用的链接:

   val df = sc.parallelize(Seq((4, 9.0), (3, 7.0), (2, 3.0), (1, 5.0))).toDF("id", "num")
df.show
+---+---+
| id|num|
+---+---+
|  4|9.0|
|  3|7.0|
|  2|3.0|
|  1|5.0|
+---+---+
df.withColumn("new_column", lag("num", 1, 0).over(w)).show
+---+---+----------+
| id|num|new_column|
+---+---+----------+
|  1|5.0|       0.0|
|  2|3.0|       5.0|
|  3|7.0|       3.0|
|  4|9.0|       7.0|
+---+---+----------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 从 dataframe 获取上一个和下一个值并添加一个新列 - getting the previous and next value from a dataframe and add a new column Pandas Dataframe基于前一行,将值添加到新列,但该列的最大值限于该列 - Pandas Dataframe Add a value to a new Column based on the previous row limited to the maximum value in that column 在数据框中创建一个新列并将 1 添加到该列的前一行 - Create a new column in a dataframe and add 1 to the previous row of that column Pandas Dataframe - 添加具有另一行值的新列 - Pandas Dataframe - Add a new Column with value from another row Pandas DataFrame:添加具有基于前一行计算值的新列 - Pandas DataFrame: Add new column with calculated values based on previous row Spark添加具有值形式的新列前一些列 - Spark add new column with value form previous some columns 添加新列,每行作为另一列的前一个组值 - Add new column with each row as previous group value from another column 通过计算具有整列的行的值,在火花 dataframe 中创建一列 - Create a column in a spark dataframe from computing a value of the row with a whole column 向 dataframe 添加一个新列,其中每一行根据它来自的 dataframe 的标题采用不同的值 - Add a new column to a dataframe in which each row adopts a different value based on the title of the dataframe it came from Python Pandas Dataframe 根据同一列中的前一行值计算新行值 - Python Pandas Dataframe calculating new row value based on previous row value within same column
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM