简体   繁体   English

pySpark中的派生列使用两列和上一行的值

[英]Derived column in pySpark using two columns and previous row's value

I would like to create a column on my spark dataframe with operations on two columns. 我想在我的spark数据框上创建一列,并对两列进行操作。

I want to create the column Areas which is calculated with the formula: 我想创建用以下公式计算的“ Areas ”列:

( (Pct_Buenos_Acum[i]-Pct_Buenos_Acum[i-1]) * (Pct_Malos_Acum[i]+Pct_Malos_Acum[i-1]) ) / 2

I have tried this: 我已经试过了:

w = Window.rowsBetween(Window.unboundedPreceding, Window.currentRow)

df= df.withColumn('Areas', (( ( col('Pct_Acum_buenos')-col('Pct_Acum_buenos' ) )*(col('Pct_Acum_malos')+col('Pct_Acum_malos')))/2).over(w))

Find attached a print of what I have so far 查找随附我到目前为止的印刷品 在此处输入图片说明

Here is a way to access previous values in pySpark. 是一种访问pySpark中先前值的方法。 Going by that. 顺其自然。

from pyspark.sql import functions as F

# adding indexs column to use in order by
df = df.withColumn('index', F.monotonicallyIncreasingId)

w = Window.partitionBy().orderBy('index')

df = df.withColumn('Areas', (((col('Pct_Acum_buenos')-F.lag(col('Pct_Acum_buenos')).over(w))*(col('Pct_Acum_malos')+F.lag(col('Pct_Acum_malos')).over(w)))/2)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用上一行的两列来确定熊猫数据框中的列值 - Using two columns from previous row to determine column value in a pandas data frame 我们可以动态检索pyspark数据帧中更新列的前一行值吗 - Can we dynamically retrieve previous row's value of a updating column in pyspark dataframe 如何使用UDF合并单列中的多列并从pyspark中的列中删除0值行 - how to merge the multiple columns in single columns using UDF and remove the 0 value row from the column in pyspark 如何使用Spark数据框中前一行的两列计算行中的列? - How to calculate a column in a Row using two columns of the previous Row in Spark Data Frame? PySpark - 拉出包含特定列最大值的行和所有列 - PySpark - Pull the row and all columns that contains the max value of specific column 对数据框列前一行和两列的乘积求和 - Sum dataframe column previous row and product of two columns Pyspark - 基于前一行值的增量值 - Pyspark - Increment value based on previous row value 如果第一列中的元素等于上一行,则增加第二列的值 - increment second columns value if element in the first column equals to previous row 根据前两列中的任何一列更新每一行的值 - Update value for every row based on either of two previous columns Pyspark - 使用当前行中的值更新条件列 - Pyspark - update column on a condition with a value from it's current row
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM