[英]Dynamically select multiple columns while joining different Dataframe in Scala Spark
[英]Perform multiple aggregations on different columns in same dataframe with alias Spark Scala
這是基於以下鏈接中 Sumit 的回答的問題
這是詳細信息
val Claim1 = StructType(Seq(StructField("pid", StringType, true),StructField("diag1", StringType,
true),StructField("diag2", StringType, true), StructField("allowed", IntegerType, true),
StructField("allowed1", IntegerType, true)))
val claimsData1 = Seq(("PID1", "diag1", "diag2", 100, 200), ("PID1", "diag2", "diag3", 300, 600),
("PID1", "diag1", "diag5", 340, 680), ("PID2", "diag3", "diag4", 245, 490), ("PID2", "diag2",
"diag1", 124, 248))
val claimRDD1 = sc.parallelize(claimsData1)
val claimRDDRow1 = claimRDD1.map(p => Row(p._1, p._2, p._3, p._4, p._5))
val claimRDD2DF1 = sqlContext.createDataFrame(claimRDDRow1, Claim1)
val exprs = Map("allowed" -> "sum", "allowed1" -> "avg")
claimRDD2DF1.groupBy("pid").agg(exprs) show false
但它不提供命名新列的別名,我有一個數據框,我需要在其中對一組列執行多次聚合,它可以是多組列上的 sum、avg、min、max,所以請告訴我如果有辦法解決上述問題或有更好的方法來實現這一目標嗎?
提前致謝。
您的代碼只需稍作修改即可工作,訣竅是調用callUDF
,它將聚合函數作為字符串並可以別名:
val exprs = Map("allowed" -> "sum", "allowed1" -> "avg")
val aggExpr = exprs.map{case (k,v) => callUDF(v,col(k)).as(k)}.toList
claimRDD2DF1.groupBy("pid").agg(aggExpr.head,aggExpr.tail:_*)
.show()
或者,如果您可以將聚合指定為函數對象,則不需要使用callUDF
:
val aggExpr = Seq(
("allowed",sum(_:Column)),
("allowed1", avg(_:Column))
)
.map{case (k,v) => v(col(k)).as(k)}
claimRDD2DF1.groupBy("pid").agg(aggExpr.head,aggExpr.tail:_*)
.show()
兩個版本都給出
+----+-------+-----------------+
| pid|allowed| allowed1|
+----+-------+-----------------+
|PID1| 740|493.3333333333333|
|PID2| 369| 369.0|
+----+-------+-----------------+
您可以定義一個帶有alias
的agg
函數列表,如下所示並使用它們
import org.apache.spark.sql.functions._
//You should at least know list of columns for particular function
val colsToSum = claimRDD2DF1.columns.filter(_.startsWith("a"))
val colsToAvg = List("allowed", "allowed1")
//define functions and its alias for list of columns
val sumList = colsToSum.map(name => sum(name).as(name + "_sum"))
val avgList = colsToAvg.map(name => avg(name).as(name + "_avg"))
//get a final list of functions
val exp = sumList ++ avgList
//Apply list functions in single groupBy
claimRDD2DF1.groupBy("pid").agg(exp.head, exp.tail: _*).show(false)
這會給你
+----+-----------+------------+------------------+-----------------+
|pid |allowed_sum|allowed1_sum|allowed_avg |allowed1_avg |
+----+-----------+------------+------------------+-----------------+
|PID1|740 |1480 |246.66666666666666|493.3333333333333|
|PID2|369 |738 |184.5 |369.0 |
+----+-----------+------------+------------------+-----------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.