数据框中新列的 PySpark 1.5 Groupby Sum

Question

I am trying to create a new column ("newaggCol") in a Spark Dataframe using groupBy and sum (with PySpark 1.5).我正在尝试使用 groupBy 和 sum（使用 PySpark 1.5）在 Spark Dataframe 中创建一个新列（“newaggCol”）。 My numeric columns have been cast to either Long or Double.我的数字列已转换为 Long 或 Double。 The columns used to form the groupBy are String and Timestamp.用于形成 groupBy 的列是 String 和 Timestamp。 My code is as follows我的代码如下

df= df.withColumn("newaggCol",(df.groupBy([df.strCol,df.tsCol]).sum(df.longCol)))

My traceback for the error is coming to that line.我对错误的回溯即将出现在该行。 And stating:并说明：

ValueError: Cannot convert column into bool: please use '&' for 'and',     '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I feel that I must be calling the functions incorrectly?我觉得我一定是错误地调用了这些函数？

Answer 1

It is not possible using SQL aggregations but you can easily get the desired result using window functions使用 SQL 聚合是不可能的，但您可以使用窗口函数轻松获得所需的结果

import sys
from pyspark.sql.window import Window
from pyspark.sql.functions import sum as sum_

w = (Window()
    .partitionBy(df.strCol, df.tsCol)
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

df.withColumn("newaggCol", sum_(df.longCol).over(w))

数据框中新列的 PySpark 1.5 Groupby Sum

问题描述

1 个解决方案

解决方案1
4 已采纳 2016-03-07 17:28:36

数据框中新列的 PySpark 1.5 Groupby Sum

问题描述

1 个解决方案

解决方案1 4 已采纳 2016-03-07 17:28:36

解决方案1
4 已采纳 2016-03-07 17:28:36