简体   繁体   中英

PySpark - For Each Unique ID and Column Condition Set Value of 1

图1

Hello - I am trying to assign value of 1 within a dataframe for the first instance of ID when the PurchasePrice is > 0. For the following instance or instance prior to that the value of the column will be 0. For example, in below screenshot, ID 123, the 'Wanted Column' will be set 1 in MonYer = 201909 since that is the first instance when the PurchasePrice is >0. For the next observation in 201911, the value will be 0. I thought about using.groupBy or rank(), dense_rank(), but can't really think of a way on how it can be done.

Any sort of guidance or help is appreciated!

You can do so using sum in combination with a window . In the window you only aggregate the price of the preceding rows. Using the resulting column you can check whether the record is the first non zero entry. The sum of preceding rows should be zero, the price of the record itself non zero.

I created a sample dataset which differs slightly from yours, but it should show you the method.

from pyspark import sql
from pyspark.sql.window import Window
from pyspark.sql import functions as f

spark = sql.SparkSession.builder.master('local').getOrCreate()

df = spark.createDataFrame([[123,201902,0],[123,201903,0],[123,201904,100],[123,201905,100],[123,201906,0]], ['ID', 'MonYer', 'Price'])

w = Window.partitionBy('ID').orderBy('MonYer').rangeBetween(Window.unboundedPreceding, -1)

df = (df
    .withColumn('sum', f.sum('Price').over(w) )
    .withColumn('wanted', f.when((f.col('Price') > 0) & (f.col('Sum') == 0), 1).otherwise(0))
    .drop('sum')
)

df.show()

+---+------+-----+------+                                                       
| ID|MonYer|Price|wanted|
+---+------+-----+------+
|123|201902|    0|     0|
|123|201903|    0|     0|
|123|201904|  100|     1|
|123|201905|  100|     0|
|123|201906|    0|     0|
+---+------+-----+------+

Note: this solutions assumes all Price values are >= 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM