简体   繁体   中英

Pyspark Groupby Create Column

In Pyspark I need to group by ID and create four new columns (min, max, std, ave).

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = (Window.orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0))

df = df.groupby("ID") \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w))

I have also tried:

df.groupby("ID").select('rpm', f.avg('rpm').over(w).alias('hr1_avg'))

However I am getting this error for both commands:

AttributeError: 'GroupedData' object has no attribute 'withColumn'

Is there a way to create a new column for each ID and create these columns or is my syntax incorrect?

Thanks.

You need to move the "grouping" column ID into the window definition as parameter for partitionBy . Then the groupBy is not necessary:

The code

w = Window.partitionBy("ID").orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0)

df \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w)) \
.show()

should print your expected results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM