使用pyspark按聚合多列分组

Question

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns我希望在下面的 Spark 数据帧上使用groupBy agg并获取每个 col1、col2、col3 列的平均值、最大值和最小值

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.我试过sp.groupBy('id').agg({'*':'max'})甚至只是获得所有的最大值但遇到错误。

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.我试过sp.groupBy('id').agg({'col1':['max', 'min', 'mean']})但这更像是一种传统的 Pandas 方法，但它没有不行。

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4

Answer 1

Try this:尝试这个：

%%pyspark
SP_agg = sp.groupBy(
    sp.id.alias('identity')
    ).agg(
        sum("col1").alias("Annual_col1"), 
        sum("col2").alias("Annual_col2"), 
        sum("col3").alias("Annual_col3"), 
        mean("col1").alias("mean_col1"), 
        mean("col2").alias("mean_col2"), 
        mean("col3").alias("mean_col3"), 
        min("col1").alias("min_col1"), 
        min("col2").alias("min_col2"), 
        min("col3").alias("min_col3"), 
        max("col1").alias("max_col1"), 
        max("col2").alias("max_col2"), 
        max("col3").alias("max_col3") 
        )
SP_agg.show(10)

使用pyspark按聚合多列分组

问题描述

1 个解决方案

解决方案1
0 2021-11-17 03:35:09

使用pyspark按聚合多列分组

问题描述

1 个解决方案

解决方案1 0 2021-11-17 03:35:09

解决方案1
0 2021-11-17 03:35:09