[英]Spark groupby sum for all columns except 1
I have a dataset with header like this:我有一个 header 的数据集,如下所示:
|State|2020-01-22|2020-01-23|2020-01-24|2020-01-25|2020-01-26|2020-01-27|2020-01-28|
and I am trying to groupBy
based on State
column and the sum of row values for each column(The number of columns remains the same).我正在尝试基于
groupBy
列和每列的行值总和(列数保持不变)进行State
。 But when I do it using:但是当我使用它时:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
df = df.groupBy('State').agg(F.sum())
But I get the error: sum() missing 1 required positional argument: 'col'
How do I get the sum of row values for each column.但我收到错误:
sum() missing 1 required positional argument: 'col'
如何获取每列的行值总和。 I also tried this:我也试过这个:
df = df.groupBy('State').agg(F.sum('2020-01-22','2020-01-23'))
and I get an error: sum() takes 1 positional argument but 2 were given
我得到一个错误:
sum() takes 1 positional argument but 2 were given
Thank you for helping me.感谢你们对我的帮助。
Use list comprehension to iterate all columns except the grouper使用列表理解来迭代除石斑鱼之外的所有列
df.groupBy('State').agg(*[sum(i).alias(f"sum_{i}") for i in df.drop('State').columns]).show()
Simply note that the GroupedData
object returned by df.groupBy()
has a sum
method that sums up all columns when passed no arguments:只需注意
GroupedData
df.groupBy()
返回的 GroupedData object 有一个sum
方法,当没有通过 arguments 时,它会汇总所有列:
>>> df.show()
+-----+---+---+
|state| a| b|
+-----+---+---+
| a| 5| 5|
| a| 6| 6|
| b| 10| 10|
+-----+---+---+
>>> df.groupBy("state").sum().show()
+-----+------+------+
|state|sum(a)|sum(b)|
+-----+------+------+
| b| 10| 10|
| a| 11| 11|
+-----+------+------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.