[英]Using two columns from previous row to determine column value in a pandas data frame
[英]Derived column in pySpark using two columns and previous row's value
我想在我的spark數據框上創建一列,並對兩列進行操作。
我想創建用以下公式計算的“ Areas
”列:
( (Pct_Buenos_Acum[i]-Pct_Buenos_Acum[i-1]) * (Pct_Malos_Acum[i]+Pct_Malos_Acum[i-1]) ) / 2
我已經試過了:
w = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)
df= df.withColumn('Areas', (( ( col('Pct_Acum_buenos')-col('Pct_Acum_buenos' ) )*(col('Pct_Acum_malos')+col('Pct_Acum_malos')))/2).over(w))
這是一種訪問pySpark中先前值的方法。 順其自然。
from pyspark.sql import functions as F
# adding indexs column to use in order by
df = df.withColumn('index', F.monotonicallyIncreasingId)
w = Window.partitionBy().orderBy('index')
df = df.withColumn('Areas', (((col('Pct_Acum_buenos')-F.lag(col('Pct_Acum_buenos')).over(w))*(col('Pct_Acum_malos')+F.lag(col('Pct_Acum_malos')).over(w)))/2)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.