简体   繁体   English

使用pyspark按聚合多列分组

[英]group by agg multiple columns with pyspark

I'm looking to groupBy agg on the below Spark dataframe and get the mean, max, and min of each of the col1, col2, col3 columns我希望在下面的 Spark 数据帧上使用groupBy agg并获取每个 col1、col2、col3 列的平均值、最大值和最小值

sp = spark.createDataFrame([['a',2,4,5], ['a',4,7,7], ['b',6,0,9], ['b', 2, 4, 4], ['c', 4, 4, 9]], ['id', 'col1', 'col2','col3'])

+---+----+----+----+
| id|col1|col2|col3|
+---+----+----+----+
|  a|   2|   4|   5|
|  a|   4|   7|   7|
|  b|   6|   0|   9|
|  b|   2|   4|   4|
|  c|   4|   4|   9|
+---+----+----+----+

I've tried sp.groupBy('id').agg({'*':'max'}) to even just get the max on all of it but running into an error.我试过sp.groupBy('id').agg({'*':'max'})甚至只是获得所有的最大值但遇到错误。

I've tried sp.groupBy('id').agg({'col1':['max', 'min', 'mean']}) but this is more of a traditional Pandas way to do it but it doesn't work.我试过sp.groupBy('id').agg({'col1':['max', 'min', 'mean']})但这更像是一种传统的 Pandas 方法,但它没有不行。

id  max(col1)  max(col2)  max(col3)  min(col1) min(col2) min(col3) mean(col1) ..
a   4          7          7          2         4         5         3   
b   6          4          9          2         0         4         4  
c   4          4          9          4         4         9         4  

Try this:尝试这个:

%%pyspark
SP_agg = sp.groupBy(
    sp.id.alias('identity')
    ).agg(
        sum("col1").alias("Annual_col1"), 
        sum("col2").alias("Annual_col2"), 
        sum("col3").alias("Annual_col3"), 
        mean("col1").alias("mean_col1"), 
        mean("col2").alias("mean_col2"), 
        mean("col3").alias("mean_col3"), 
        min("col1").alias("min_col1"), 
        min("col2").alias("min_col2"), 
        min("col3").alias("min_col3"), 
        max("col1").alias("max_col1"), 
        max("col2").alias("max_col2"), 
        max("col3").alias("max_col3") 
        )
SP_agg.show(10)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM