简体   繁体   English

PySpark - 对于每个唯一 ID 和列条件设置值为 1

[英]PySpark - For Each Unique ID and Column Condition Set Value of 1

图1

Hello - I am trying to assign value of 1 within a dataframe for the first instance of ID when the PurchasePrice is > 0. For the following instance or instance prior to that the value of the column will be 0. For example, in below screenshot, ID 123, the 'Wanted Column' will be set 1 in MonYer = 201909 since that is the first instance when the PurchasePrice is >0.您好 - 当 PurchasePrice > 0 时,我正在尝试在 dataframe 中为 ID 的第一个实例分配 1。对于以下实例或在此之前的实例,列的值将为 0。例如,在下面的屏幕截图中,ID 123,'Wanted Column' 将在 MonYer = 201909 中设置为 1,因为这是 PurchasePrice > 0 时的第一个实例。 For the next observation in 201911, the value will be 0. I thought about using.groupBy or rank(), dense_rank(), but can't really think of a way on how it can be done.对于201911的下一次观察,该值将是0。我想过使用.groupBy或rank(),dense_rank(),但实在想不出如何做到这一点。

Any sort of guidance or help is appreciated!任何形式的指导或帮助表示赞赏!

You can do so using sum in combination with a window .您可以将sumwindow结合使用。 In the window you only aggregate the price of the preceding rows.在 window 中,您只需汇总前几行的价格。 Using the resulting column you can check whether the record is the first non zero entry.使用结果列,您可以检查记录是否是第一个非零条目。 The sum of preceding rows should be zero, the price of the record itself non zero.前面行的总和应该为零,记录本身的价格非零。

I created a sample dataset which differs slightly from yours, but it should show you the method.我创建了一个与您的略有不同的示例数据集,但它应该向您展示该方法。

from pyspark import sql
from pyspark.sql.window import Window
from pyspark.sql import functions as f

spark = sql.SparkSession.builder.master('local').getOrCreate()

df = spark.createDataFrame([[123,201902,0],[123,201903,0],[123,201904,100],[123,201905,100],[123,201906,0]], ['ID', 'MonYer', 'Price'])

w = Window.partitionBy('ID').orderBy('MonYer').rangeBetween(Window.unboundedPreceding, -1)

df = (df
    .withColumn('sum', f.sum('Price').over(w) )
    .withColumn('wanted', f.when((f.col('Price') > 0) & (f.col('Sum') == 0), 1).otherwise(0))
    .drop('sum')
)

df.show()

+---+------+-----+------+                                                       
| ID|MonYer|Price|wanted|
+---+------+-----+------+
|123|201902|    0|     0|
|123|201903|    0|     0|
|123|201904|  100|     1|
|123|201905|  100|     0|
|123|201906|    0|     0|
+---+------+-----+------+

Note: this solutions assumes all Price values are >= 0注意:此解决方案假定所有价格值都 >= 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM