簡體   English   中英

使用別名 Spark Scala 在同一數據幀中的不同列上執行多個聚合

[英]Perform multiple aggregations on different columns in same dataframe with alias Spark Scala

這是基於以下鏈接中 Sumit 的回答的問題

[ Spark SQL:將聚合函數應用於列列表

這是詳細信息

val Claim1 = StructType(Seq(StructField("pid", StringType, true),StructField("diag1", StringType, 
true),StructField("diag2", StringType, true), StructField("allowed", IntegerType, true), 
StructField("allowed1", IntegerType, true)))

val claimsData1 = Seq(("PID1", "diag1", "diag2", 100, 200), ("PID1", "diag2", "diag3", 300, 600), 
("PID1", "diag1", "diag5", 340, 680), ("PID2", "diag3", "diag4", 245, 490), ("PID2", "diag2", 
"diag1", 124, 248))

val claimRDD1 = sc.parallelize(claimsData1)
val claimRDDRow1 = claimRDD1.map(p => Row(p._1, p._2, p._3, p._4, p._5))
val claimRDD2DF1 = sqlContext.createDataFrame(claimRDDRow1, Claim1)
val exprs = Map("allowed" -> "sum", "allowed1" -> "avg")
claimRDD2DF1.groupBy("pid").agg(exprs) show false

但它不提供命名新列的別名,我有一個數據框,我需要在其中對一組列執行多次聚合,它可以是多組列上的 sum、avg、min、max,所以請告訴我如果有辦法解決上述問題或有更好的方法來實現這一目標嗎?

提前致謝。

您的代碼只需稍作修改即可工作,訣竅是調用callUDF ,它將聚合函數作為字符串並可以別名:

val exprs = Map("allowed" -> "sum", "allowed1" -> "avg")

val aggExpr = exprs.map{case (k,v)  => callUDF(v,col(k)).as(k)}.toList

claimRDD2DF1.groupBy("pid").agg(aggExpr.head,aggExpr.tail:_*)
  .show()

或者,如果您可以將聚合指定為函數對象,則不需要使用callUDF

val aggExpr = Seq(
  ("allowed",sum(_:Column)),
  ("allowed1", avg(_:Column))
)
  .map{case (k,v)  => v(col(k)).as(k)}


claimRDD2DF1.groupBy("pid").agg(aggExpr.head,aggExpr.tail:_*)
  .show()

兩個版本都給出

+----+-------+-----------------+
| pid|allowed|         allowed1|
+----+-------+-----------------+
|PID1|    740|493.3333333333333|
|PID2|    369|            369.0|
+----+-------+-----------------+

您可以定義一個帶有aliasagg函數列表,如下所示並使用它們

import org.apache.spark.sql.functions._

//You should at least know list of columns for particular function   
val colsToSum = claimRDD2DF1.columns.filter(_.startsWith("a"))
val colsToAvg = List("allowed", "allowed1")

//define functions and its alias for list of columns 
val sumList = colsToSum.map(name => sum(name).as(name + "_sum"))
val avgList = colsToAvg.map(name => avg(name).as(name + "_avg"))

//get a final list of functions
val exp = sumList  ++ avgList

//Apply list functions in single groupBy 
claimRDD2DF1.groupBy("pid").agg(exp.head, exp.tail: _*).show(false)

這會給你

+----+-----------+------------+------------------+-----------------+
|pid |allowed_sum|allowed1_sum|allowed_avg       |allowed1_avg     |
+----+-----------+------------+------------------+-----------------+
|PID1|740        |1480        |246.66666666666666|493.3333333333333|
|PID2|369        |738         |184.5             |369.0            |
+----+-----------+------------+------------------+-----------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM