Pyspark Groupby Create Column

Question

In Pyspark I need to group by ID and create four new columns (min, max, std, ave).

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = (Window.orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0))

df = df.groupby("ID") \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w))

I have also tried:

df.groupby("ID").select('rpm', f.avg('rpm').over(w).alias('hr1_avg'))

However I am getting this error for both commands:

AttributeError: 'GroupedData' object has no attribute 'withColumn'

Is there a way to create a new column for each ID and create these columns or is my syntax incorrect?

Thanks.

Answer 1

You need to move the "grouping" column ID into the window definition as parameter for partitionBy . Then the groupBy is not necessary:

The code

w = Window.partitionBy("ID").orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0)

df \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w)) \
.show()

should print your expected results.

Pyspark Groupby Create Column

Question

1 answers

solution1
1 ACCPTED 2020-09-21 19:33:29

Pyspark Groupby Create Column

Question

1 answers

solution1 1 ACCPTED 2020-09-21 19:33:29

solution1
1 ACCPTED 2020-09-21 19:33:29