Spark groupby sum for all columns except 1

Question

I have a dataset with header like this:

|State|2020-01-22|2020-01-23|2020-01-24|2020-01-25|2020-01-26|2020-01-27|2020-01-28|

and I am trying to groupBy based on State column and the sum of row values for each column(The number of columns remains the same). But when I do it using:

from pyspark.sql import SparkSession
import pyspark.sql.functions as F
df = df.groupBy('State').agg(F.sum())

But I get the error: sum() missing 1 required positional argument: 'col' How do I get the sum of row values for each column. I also tried this:

df = df.groupBy('State').agg(F.sum('2020-01-22','2020-01-23'))

and I get an error: sum() takes 1 positional argument but 2 were given

Thank you for helping me.

Answer 1

Use list comprehension to iterate all columns except the grouper

 df.groupBy('State').agg(*[sum(i).alias(f"sum_{i}") for i in df.drop('State').columns]).show()

Answer 2

Simply note that the GroupedData object returned by df.groupBy() has a sum method that sums up all columns when passed no arguments:

>>> df.show()
+-----+---+---+
|state|  a|  b|
+-----+---+---+
|    a|  5|  5|
|    a|  6|  6|
|    b| 10| 10|
+-----+---+---+

>>> df.groupBy("state").sum().show()
+-----+------+------+
|state|sum(a)|sum(b)|
+-----+------+------+
|    b|    10|    10|
|    a|    11|    11|
+-----+------+------+

Spark groupby sum for all columns except 1

Question

2 answers

solution1
0 2022-04-19 00:46:09

solution2
0 2022-04-19 07:16:43

Spark groupby sum for all columns except 1

Question

2 answers

solution1 0 2022-04-19 00:46:09

solution2 0 2022-04-19 07:16:43

solution1
0 2022-04-19 00:46:09

solution2
0 2022-04-19 07:16:43