简体   繁体   中英

How to get value from previous calculated column in Pyspark / Python data set

I am trying to create a new column(B) in a Pyspark / Python table. New column(B) is sum of: current value of column(A) + previous value of column(B)

desired output example image

`Id   a     b
1    977   977
2    3665  4642
3    1746  6388
4    2843  9231
5    200   9431`

current Col B = current Col A + previous Col B; example Row 4: 9231 (col B) = 2843 (col A) + 6388 (previous Col B value)

(for 1st row since there is no previous value for B so it is 0)

Please help me with the Python / PySpark query code

Without the context I may be wrong, but it seems your trying to do a cumulative sum of column A:

from pyspark.sql.window import Window
import pyspark.sql.functions as sf

df = df.withColumn('B', sf.sum(df.A).over(Window.partitionBy().orderBy().rowsBetween(
Window.unboundedPreceding, 0)))

EDIT:

If you need to iteratively add new rows based on the last value of B and assuming the value of B in the dataframe doesn't change in the meantime, I think you'd better memorize B in a standard python variable and build the following row with that.

previous_B = 0
# your code to get new A
previous_B += new_A
new_row = spark.createDataFrame([(new_A, previous_B)])
df = df.union(new_row)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM