简体   繁体   English

如何在 pyspark 中的 window 条件下基于多个条件创建新列?

[英]How to create new column based on multiple when conditions over window in pyspark?

I have a spark dataframe我有spark dataframe

foo = pd.DataFrame({'id': [1,1,1,2,2,2], 'min_col': [2,2,3,4,5,6], 'raw': [1,5,2,3,4,3],
'max_col': [3,4,5,6,7,8]})

I want to create a new column new_col which will be 1 if the min(raw) < min(min_col) or if the max(raw) > min(max_col) , otherwise 0 , by id我想创建一个新1 new_col如果min(raw) < min(min_col)max(raw) > min(max_col) ,否则0id

I tried我试过了

from pyspark.sql.window import Window
w = Window.partitionBy('id')
from pyspark.sql import functions as f
foo.withColumn('new_col', 
        f.when((f.min(f.col('raw')) < f.min(f.col('min_col'))) |
               (f.max(f.col('raw')) > f.min(f.col('max_col'))),f.lit(1)).otherwise(f.lit(0)).over(w))

But I get an error id is not an aggregate function .但我得到一个错误id is not an aggregate function Any ideas?有任何想法吗?

You need to specify the window for the functions min and max :您需要为函数minmax指定 window :

from pyspark.sql import functions as F, Window

df2 = df.withColumn(
    'new_col',
    F.when(
        (F.min('raw').over(w) < F.min('min_col').over(w)) | 
        (F.max('raw').over(w) > F.min('max_col').over(w)), 1
    ).otherwise(0)
)

df2.show()
+---+-------+---+-------+-------+
| id|min_col|raw|max_col|new_col|
+---+-------+---+-------+-------+
|  1|      2|  1|      3|      1|
|  1|      2|  5|      4|      1|
|  1|      3|  2|      5|      1|
|  2|      4|  3|      6|      1|
|  2|      5|  4|      7|      1|
|  2|      6|  3|      8|      1|
+---+-------+---+-------+-------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM