[英]Create new column with max value based on filtered rows with groupby in pyspark
I have a spark
dataframe我有
spark
dataframe
import pandas as pd
foo = pd.DataFrame({'id': [1,1,2,2,2], 'col': ['a','b','a','a','b'], 'value': [1,5,2,3,4],
'col_b': ['a','c','a','a','c']})
I want to create a new column with the max
of the value
column, groupped by id
.我想用
value
列的max
创建一个新列,按id
分组。 But I want the max
value
only for the rows that col==col_b
但我只想要
col==col_b
value
行的max
My result spark dataframe should look like this我的结果火花 dataframe 应该看起来像这样
foo = pd.DataFrame({'id': [1,1,2,2,2], 'col': ['a','b','a','a','b'], 'value': [1,5,2,3,4],
'max_value':[1,1,3,3,3], 'col_b': ['a','c','a','a','c']})
I have tried我努力了
from pyspark.sql import functions as f
from pyspark.sql.window import Window
w = Window.partitionBy('id')
foo = foo.withColumn('max_value', f.max('value').over(w))\
.where(f.col('col') == f.col('col_b'))
But I end up losing some rows.但我最终失去了一些行。
Any ideas?有任何想法吗?
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.