[英]Split Spark dataframe and calculate average based on one column value
我有兩個數據classRecord
,第一個數據classRecord
具有10個不同的條目,如下所示:
Class, Calculation
first, Average
Second, Sum
Third, Average
第二個數據studentRecord
具有大約5萬個條目,如下所示:
Name, height, Camp, Class
Shae, 152, yellow, first
Joe, 140, yellow, first
Mike, 149, white, first
Anne, 142, red, first
Tim, 154, red, Second
Jake, 153, white, Second
Sherley, 153, white, Second
從第二個數據幀,根據班級類型,我想分別根據營地(如果班級是第一班,則黃色,白色和黑色平均)計算身高(第一班:平均值,第二班:總和等)。等等)。 我嘗試了以下代碼:
//function to calculate average
def averageOnName(splitFrame : org.apache.spark.sql.DataFrame ) : Array[(String, Double)] = {
val pairedRDD: RDD[(String, Double)] = splitFrame.select($"Name",$"height".cast("double")).as[(String, Double)].rdd
var avg_by_key = pairedRDD.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(y => 1.0 * y._1 / y._2).collect
return avg_by_key
}
//required schema for further modifications
val schema = StructType(
StructField("name", StringType, false) ::
StructField("avg", DoubleType, false) :: Nil)
// for each loop on each class type
classRecord.rdd.foreach{
//filter students based on camps
var campYellow =studentRecord.filter($"Camp" === "yellow")
var campWhite =studentRecord.filter($"Camp" === "white")
var campRed =studentRecord.filter($"Camp" === "red")
// since I know that calculation for first class is average, so representing calculation only for class first
val avgcampYellow = averageOnName(campYellow)
val avgcampWhite = averageOnName(campWhite)
val avgcampRed = averageOnName(campRed)
// union of all
val rddYellow = sc.parallelize (avgcampYellow).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfYellow = sqlContext.createDataFrame(rddYellow, schema)
//union with yellow camp data
val rddWhite = sc.parallelize (avgcampWhite).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfWhite = sqlContext.createDataFrame(rddWhite, schema)
var dfYellWhite = dfYellow.union(dfWhite)
//union with yellow,white camp data
val rddRed = sc.parallelize (avgcampRed).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
//conversion of rdd to frame
var dfRed = sqlContext.createDataFrame(rddRed, schema)
var dfYellWhiteRed = dfYellWhite .union(dfRed)
// other modifications and final result to hive
}
我在這里苦苦掙扎:
任何幫助表示贊賞。
您可以簡單地對Class / Camp的所有組合進行平均值和求和,然后分別解析classRecord
數據幀並提取所需的內容。 您可以使用groupBy()
方法輕松地執行此操作並匯總值。
使用示例數據框:
val spark = SparkSession.builder.getOrCreate()
import spark.implicits._
studentRecord.show()
+-------+------+------+------+
| Name|height| Camp| Class|
+-------+------+------+------+
| Shae| 152|yellow| first|
| Joe| 140|yellow| first|
| Mike| 149| white| first|
| Anne| 142| red| first|
| Tim| 154| red|Second|
| Jake| 153| white|Second|
|Sherley| 153| white|Second|
+-------+------+------+------+
val df = studentRecord.groupBy("Class", "Camp")
.agg(
sum($"height").as("Sum"),
avg($"height").as("Average"),
collect_list($"Name").as("Names")
)
df.show()
+------+------+---+-------+---------------+
| Class| Camp|Sum|Average| Names|
+------+------+---+-------+---------------+
| first| white|149| 149.0| [Mike]|
| first| red|142| 142.0| [Anne]|
|Second| red|154| 154.0| [Tim]|
|Second| white|306| 153.0|[Jake, Sherley]|
| first|yellow|292| 146.0| [Shae, Joe]|
+------+------+---+-------+---------------+
完成此操作后,您可以簡單地檢查您的第一個classRecord
數據classRecord
之后是所需的行。 它的外觀示例可以根據您的實際需求進行更改:
// Collects the dataframe as an Array[(String, String)]
val classRecs = classRecord.collect().map{case Row(clas: String, calc: String) => (clas, calc)}
for (classRec <- classRecs){
val clas = classRec._1
val calc = classRec._2
// Matches which calculation you want to do
val df2 = calc match {
case "Average" => df.filter($"Class" === clas).select("Class", "Camp", "Average")
case "Sum" => df.filter($"Class" === clas).select("Class", "Camp", "Sum")
}
// Do something with df2
}
希望能幫助到你!
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.