In Pyspark I need to group by ID
and create four new columns (min, max, std, ave).
from pyspark.sql import functions as F
from pyspark.sql.window import Window
w = (Window.orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0))
df = df.groupby("ID") \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w))
I have also tried:
df.groupby("ID").select('rpm', f.avg('rpm').over(w).alias('hr1_avg'))
However I am getting this error for both commands:
AttributeError: 'GroupedData' object has no attribute 'withColumn'
Is there a way to create a new column for each ID
and create these columns or is my syntax incorrect?
Thanks.
You need to move the "grouping" column ID
into the window definition as parameter for partitionBy
. Then the groupBy
is not necessary:
The code
w = Window.partitionBy("ID").orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0)
df \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w)) \
.show()
should print your expected results.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.