简体   繁体   English

数据框中新列的 PySpark 1.5 Groupby Sum

[英]PySpark 1.5 Groupby Sum for new column in Dataframe

I am trying to create a new column ("newaggCol") in a Spark Dataframe using groupBy and sum (with PySpark 1.5).我正在尝试使用 groupBy 和 sum(使用 PySpark 1.5)在 Spark Dataframe 中创建一个新列(“newaggCol”)。 My numeric columns have been cast to either Long or Double.我的数字列已转换为 Long 或 Double。 The columns used to form the groupBy are String and Timestamp.用于形成 groupBy 的列是 String 和 Timestamp。 My code is as follows我的代码如下

df= df.withColumn("newaggCol",(df.groupBy([df.strCol,df.tsCol]).sum(df.longCol)))

My traceback for the error is coming to that line.我对错误的回溯即将出现在该行。 And stating:并说明:

ValueError: Cannot convert column into bool: please use '&' for 'and',     '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

I feel that I must be calling the functions incorrectly?我觉得我一定是错误地调用了这些函数?

It is not possible using SQL aggregations but you can easily get the desired result using window functions使用 SQL 聚合是不可能的,但您可以使用窗口函数轻松获得所需的结果

import sys
from pyspark.sql.window import Window
from pyspark.sql.functions import sum as sum_

w = (Window()
    .partitionBy(df.strCol, df.tsCol)
    .rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))

df.withColumn("newaggCol", sum_(df.longCol).over(w))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM