简体   繁体   English

如何从 Pyspark / Python 数据集中先前计算的列中获取值

[英]How to get value from previous calculated column in Pyspark / Python data set

I am trying to create a new column(B) in a Pyspark / Python table.我正在尝试在 Pyspark / Python 表中创建一个新列(B)。 New column(B) is sum of: current value of column(A) + previous value of column(B)新列 (B) 是:列 (A) 的当前值 + 列 (B) 的先前值的总和

desired output example image所需的 output 示例图像

`Id   a     b
1    977   977
2    3665  4642
3    1746  6388
4    2843  9231
5    200   9431`

current Col B = current Col A + previous Col B;当前列 B = 当前列 A + 先前列 B; example Row 4: 9231 (col B) = 2843 (col A) + 6388 (previous Col B value)示例第 4 行:9231(B 列)= 2843(A 列)+ 6388(以前的 B 列值)

(for 1st row since there is no previous value for B so it is 0) (对于第一行,因为 B 没有先前的值,所以它是 0)

Please help me with the Python / PySpark query code请帮我查询 Python / PySpark 查询代码

Without the context I may be wrong, but it seems your trying to do a cumulative sum of column A:如果没有上下文,我可能是错的,但您似乎试图对 A 列进行累积总和:

from pyspark.sql.window import Window
import pyspark.sql.functions as sf

df = df.withColumn('B', sf.sum(df.A).over(Window.partitionBy().orderBy().rowsBetween(
Window.unboundedPreceding, 0)))

EDIT:编辑:

If you need to iteratively add new rows based on the last value of B and assuming the value of B in the dataframe doesn't change in the meantime, I think you'd better memorize B in a standard python variable and build the following row with that.如果您需要根据 B 的最后一个值迭代地添加新行,并假设 dataframe 中的 B 值同时不变,我认为您最好将 B 记住在标准 python 变量中并构建以下行接着就,随即。

previous_B = 0
# your code to get new A
previous_B += new_A
new_row = spark.createDataFrame([(new_A, previous_B)])
df = df.union(new_row)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 pyspark - 根据另一个计算列的计算值更新列 - pyspark - Updating a column based on a calculated value from another calculated column Pyspark:防止列值在计算后发生变化 - Pyspark: Prevent Column value from changing once calculated 如何根据计算的同一列中的先前值计算 pandas 列? - How to calculate a pandas column based on the previous value in the same column that is calculated? PySpark:如何在PySpark SQL中创建计算列? - PySpark: How to create a calculated column in PySpark SQL? 从 Pyspark 列中获取值并将其与 Python 字典进行比较 - Get value from Pyspark Column and compare it to a Python dictionary 如何根据条件使用计算出的新 pandas 列中的先前值? - How to use the previous value from calculated new pandas column based on conditions? 在从另一个列值派生的数据框中设置计算的列值 - Set a calculated column value in a data frame derived from another columns value PySpark获得上一组的价值 - PySpark get value of previous group 当前一个值也使用组数据计算时,如何在 pandas dataframe 中使用前一行值 - How to use a previous row value in a pandas dataframe when the previous value is also calculated witht group data 如何获取PySpark中列的最后一个值 - How to get last value of a column in PySpark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM