简体   繁体   English

除 1 以外的所有列的 Spark groupby 总和

[英]Spark groupby sum for all columns except 1

I have a dataset with header like this:我有一个 header 的数据集,如下所示:

|State|2020-01-22|2020-01-23|2020-01-24|2020-01-25|2020-01-26|2020-01-27|2020-01-28|

and I am trying to groupBy based on State column and the sum of row values for each column(The number of columns remains the same).我正在尝试基于groupBy列和每列的行值总和(列数保持不变)进行State But when I do it using:但是当我使用它时:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
df = df.groupBy('State').agg(F.sum())

But I get the error: sum() missing 1 required positional argument: 'col' How do I get the sum of row values for each column.但我收到错误: sum() missing 1 required positional argument: 'col'如何获取每列的行值总和。 I also tried this:我也试过这个:

df = df.groupBy('State').agg(F.sum('2020-01-22','2020-01-23'))

and I get an error: sum() takes 1 positional argument but 2 were given我得到一个错误: sum() takes 1 positional argument but 2 were given

Thank you for helping me.感谢你们对我的帮助。

Use list comprehension to iterate all columns except the grouper使用列表理解来迭代除石斑鱼之外的所有列

 df.groupBy('State').agg(*[sum(i).alias(f"sum_{i}") for i in df.drop('State').columns]).show()

Simply note that the GroupedData object returned by df.groupBy() has a sum method that sums up all columns when passed no arguments:只需注意GroupedData df.groupBy()返回的 GroupedData object 有一个sum方法,当没有通过 arguments 时,它会汇总所有列:

>>> df.show()
+-----+---+---+
|state|  a|  b|
+-----+---+---+
|    a|  5|  5|
|    a|  6|  6|
|    b| 10| 10|
+-----+---+---+

>>> df.groupBy("state").sum().show()
+-----+------+------+
|state|sum(a)|sum(b)|
+-----+------+------+
|    b|    10|    10|
|    a|    11|    11|
+-----+------+------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM