简体   繁体   中英

How to create new column based on multiple when conditions over window in pyspark?

I have a spark dataframe

foo = pd.DataFrame({'id': [1,1,1,2,2,2], 'min_col': [2,2,3,4,5,6], 'raw': [1,5,2,3,4,3],
'max_col': [3,4,5,6,7,8]})

I want to create a new column new_col which will be 1 if the min(raw) < min(min_col) or if the max(raw) > min(max_col) , otherwise 0 , by id

I tried

from pyspark.sql.window import Window
w = Window.partitionBy('id')
from pyspark.sql import functions as f
foo.withColumn('new_col', 
        f.when((f.min(f.col('raw')) < f.min(f.col('min_col'))) |
               (f.max(f.col('raw')) > f.min(f.col('max_col'))),f.lit(1)).otherwise(f.lit(0)).over(w))

But I get an error id is not an aggregate function . Any ideas?

You need to specify the window for the functions min and max :

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'new_col',
    F.when(
        (F.min('raw').over(w) < F.min('min_col').over(w)) | 
        (F.max('raw').over(w) > F.min('max_col').over(w)), 1
    ).otherwise(0)
)

df2.show()
+---+-------+---+-------+-------+
| id|min_col|raw|max_col|new_col|
+---+-------+---+-------+-------+
|  1|      2|  1|      3|      1|
|  1|      2|  5|      4|      1|
|  1|      3|  2|      5|      1|
|  2|      4|  3|      6|      1|
|  2|      5|  4|      7|      1|
|  2|      6|  3|      8|      1|
+---+-------+---+-------+-------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM