简体   繁体   中英

Pyspark Groupby Create Column

In Pyspark I need to group by ID and create four new columns (min, max, std, ave).

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = (Window.orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0))

df = df.groupby("ID") \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w))

I have also tried:

df.groupby("ID").select('rpm', f.avg('rpm').over(w).alias('hr1_avg'))

However I am getting this error for both commands:

AttributeError: 'GroupedData' object has no attribute 'withColumn'

Is there a way to create a new column for each ID and create these columns or is my syntax incorrect?


You need to move the "grouping" column ID into the window definition as parameter for partitionBy . Then the groupBy is not necessary:

The code

w = Window.partitionBy("ID").orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0)

df \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w)) \

should print your expected results.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM