简体   繁体   English

Pyspark Groupby创建列

[英]Pyspark Groupby Create Column

In Pyspark I need to group by ID and create four new columns (min, max, std, ave).在 Pyspark 中,我需要按ID分组并创建四个新列(min、max、std、ave)。

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = (Window.orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0))

df = df.groupby("ID") \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w))

I have also tried:我也试过:

df.groupby("ID").select('rpm', f.avg('rpm').over(w).alias('hr1_avg'))

However I am getting this error for both commands:但是我对这两个命令都收到此错误:

AttributeError: 'GroupedData' object has no attribute 'withColumn'

Is there a way to create a new column for each ID and create these columns or is my syntax incorrect?有没有办法为每个ID创建一个新列并创建这些列,或者我的语法不正确?

Thanks.谢谢。

You need to move the "grouping" column ID into the window definition as parameter for partitionBy .您需要将“分组”列ID移动到 window 定义中作为partitionBy的参数。 Then the groupBy is not necessary:然后groupBy不是必需的:

The code代码

w = Window.partitionBy("ID").orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0)

df \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w)) \
.show()

should print your expected results.应该打印您的预期结果。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM