[英]How to create new column based on multiple when conditions over window in pyspark?
I have a spark
dataframe我有
spark
dataframe
foo = pd.DataFrame({'id': [1,1,1,2,2,2], 'min_col': [2,2,3,4,5,6], 'raw': [1,5,2,3,4,3],
'max_col': [3,4,5,6,7,8]})
I want to create a new column new_col
which will be 1
if the min(raw) < min(min_col)
or if the max(raw) > min(max_col)
, otherwise 0
, by id
我想创建一个新
1
new_col
如果min(raw) < min(min_col)
或max(raw) > min(max_col)
,否则0
,按id
I tried我试过了
from pyspark.sql.window import Window
w = Window.partitionBy('id')
from pyspark.sql import functions as f
foo.withColumn('new_col',
f.when((f.min(f.col('raw')) < f.min(f.col('min_col'))) |
(f.max(f.col('raw')) > f.min(f.col('max_col'))),f.lit(1)).otherwise(f.lit(0)).over(w))
But I get an error id is not an aggregate function
.但我得到一个错误
id is not an aggregate function
。 Any ideas?有任何想法吗?
You need to specify the window for the functions min
and max
:您需要为函数
min
和max
指定 window :
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'new_col',
F.when(
(F.min('raw').over(w) < F.min('min_col').over(w)) |
(F.max('raw').over(w) > F.min('max_col').over(w)), 1
).otherwise(0)
)
df2.show()
+---+-------+---+-------+-------+
| id|min_col|raw|max_col|new_col|
+---+-------+---+-------+-------+
| 1| 2| 1| 3| 1|
| 1| 2| 5| 4| 1|
| 1| 3| 2| 5| 1|
| 2| 4| 3| 6| 1|
| 2| 5| 4| 7| 1|
| 2| 6| 3| 8| 1|
+---+-------+---+-------+-------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.