Pyspark Groupby创建列

Question

In Pyspark I need to group by ID and create four new columns (min, max, std, ave).在 Pyspark 中，我需要按ID分组并创建四个新列（min、max、std、ave）。

from pyspark.sql import functions as F
from pyspark.sql.window import Window

w = (Window.orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0))

df = df.groupby("ID") \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w))

I have also tried:我也试过：

df.groupby("ID").select('rpm', f.avg('rpm').over(w).alias('hr1_avg'))

However I am getting this error for both commands:但是我对这两个命令都收到此错误：

AttributeError: 'GroupedData' object has no attribute 'withColumn'

Is there a way to create a new column for each ID and create these columns or is my syntax incorrect?有没有办法为每个ID创建一个新列并创建这些列，或者我的语法不正确？

Thanks.谢谢。

Answer 1

You need to move the "grouping" column ID into the window definition as parameter for partitionBy .您需要将“分组”列ID移动到 window 定义中作为partitionBy的参数。 Then the groupBy is not necessary:然后groupBy不是必需的：

The code代码

w = Window.partitionBy("ID").orderBy(F.col("Date").cast('long')).rowsBetween(-4, 0)

df \
.withColumn('hr1_ave', F.avg("rpm").over(w))\
.withColumn('hr1_std', F.stddev("rpm").over(w))\
.withColumn('hr1_min', F.min("rpm").over(w))\
.withColumn('hr1_max', F.max("rpm").over(w)) \
.show()

should print your expected results.应该打印您的预期结果。

Pyspark Groupby创建列

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-09-21 19:33:29

Pyspark Groupby创建列

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-09-21 19:33:29

解决方案1
1 已采纳 2020-09-21 19:33:29