group by agg multiple columns with pyspark

Question

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4

Answer 1

Try this:

%%pyspark
SP_agg = sp.groupBy(
    sp.id.alias('identity')
    ).agg(
        sum("col1").alias("Annual_col1"), 
        sum("col2").alias("Annual_col2"), 
        sum("col3").alias("Annual_col3"), 
        mean("col1").alias("mean_col1"), 
        mean("col2").alias("mean_col2"), 
        mean("col3").alias("mean_col3"), 
        min("col1").alias("min_col1"), 
        min("col2").alias("min_col2"), 
        min("col3").alias("min_col3"), 
        max("col1").alias("max_col1"), 
        max("col2").alias("max_col2"), 
        max("col3").alias("max_col3") 
        )
SP_agg.show(10)

group by agg multiple columns with pyspark

Question

1 answers

solution1
0 2021-11-17 03:35:09

group by agg multiple columns with pyspark

Question

1 answers

solution1 0 2021-11-17 03:35:09

solution1
0 2021-11-17 03:35:09