I have a spark
dataframe
foo = pd.DataFrame({'id': [1,1,1,2,2,2], 'min_col': [2,2,3,4,5,6], 'raw': [1,5,2,3,4,3],
'max_col': [3,4,5,6,7,8]})
I want to create a new column new_col
which will be 1
if the min(raw) < min(min_col)
or if the max(raw) > min(max_col)
, otherwise 0
, by id
I tried
from pyspark.sql.window import Window
w = Window.partitionBy('id')
from pyspark.sql import functions as f
foo.withColumn('new_col',
f.when((f.min(f.col('raw')) < f.min(f.col('min_col'))) |
(f.max(f.col('raw')) > f.min(f.col('max_col'))),f.lit(1)).otherwise(f.lit(0)).over(w))
But I get an error id is not an aggregate function
. Any ideas?
You need to specify the window for the functions min
and max
:
from pyspark.sql import functions as F, Window
df2 = df.withColumn(
'new_col',
F.when(
(F.min('raw').over(w) < F.min('min_col').over(w)) |
(F.max('raw').over(w) > F.min('max_col').over(w)), 1
).otherwise(0)
)
df2.show()
+---+-------+---+-------+-------+
| id|min_col|raw|max_col|new_col|
+---+-------+---+-------+-------+
| 1| 2| 1| 3| 1|
| 1| 2| 5| 4| 1|
| 1| 3| 2| 5| 1|
| 2| 4| 3| 6| 1|
| 2| 5| 4| 7| 1|
| 2| 6| 3| 8| 1|
+---+-------+---+-------+-------+
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.