简体   繁体   English

拆分Spark数据框并基于一列值计算平均值

[英]Split Spark dataframe and calculate average based on one column value

I have two dataframes, the first dataframe classRecord has 10 different entries like the following: 我有两个数据classRecord ,第一个数据classRecord具有10个不同的条目,如下所示:

Class, Calculation
first, Average
Second, Sum
Third, Average

Second dataframe studentRecord has around 50K entries like the following: 第二个数据studentRecord具有大约5万个条目,如下所示:

Name, height, Camp, Class
Shae, 152, yellow, first
Joe, 140, yellow, first
Mike, 149, white, first
Anne, 142, red, first
Tim, 154, red, Second
Jake, 153, white, Second
Sherley, 153, white, Second

From second dataframe, based on class type, I would like to perform calculation on height (for class first: average, for class second: sum, etc.) based on the camp separately (if class is first, avg of yellow, white and so on separately). 从第二个数据帧,根据班级类型,我想分别根据营地(如果班级是第一班,则黄色,白色和黑色平均)计算身高(第一班:平均值,第二班:总和等)。等等)。 I tried the following code: 我尝试了以下代码:

//function to calculate average
def averageOnName(splitFrame : org.apache.spark.sql.DataFrame ) : Array[(String, Double)] = {
  val pairedRDD: RDD[(String, Double)] = splitFrame.select($"Name",$"height".cast("double")).as[(String, Double)].rdd
  var avg_by_key = pairedRDD.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(y => 1.0 * y._1 / y._2).collect
  return avg_by_key
}

//required schema for further modifications
val schema = StructType(
StructField("name", StringType, false) ::
StructField("avg", DoubleType, false) :: Nil)

// for each loop on each class type
classRecord.rdd.foreach{
  //filter students based on camps
  var campYellow =studentRecord.filter($"Camp" === "yellow")
  var campWhite =studentRecord.filter($"Camp" === "white")
  var campRed =studentRecord.filter($"Camp" === "red")

  // since I know that calculation for first class is average, so representing calculation only for class first
  val avgcampYellow  =  averageOnName(campYellow)
  val avgcampWhite   =  averageOnName(campWhite)
  val avgcampRed   =  averageOnName(campRed)

  // union of all
  val rddYellow = sc.parallelize (avgcampYellow).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
  //conversion of rdd to frame
  var dfYellow = sqlContext.createDataFrame(rddYellow, schema)
  //union with yellow camp data
  val rddWhite = sc.parallelize (avgcampWhite).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
  //conversion of rdd to frame
  var dfWhite = sqlContext.createDataFrame(rddWhite, schema)
  var dfYellWhite = dfYellow.union(dfWhite)
  //union with yellow,white camp data
  val rddRed = sc.parallelize (avgcampRed).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
  //conversion of rdd to frame
  var dfRed = sqlContext.createDataFrame(rddRed, schema)
  var dfYellWhiteRed = dfYellWhite .union(dfRed)
  // other modifications and final result to hive
}

Here I am struggling with: 我在这里苦苦挣扎:

  1. Hardcoding yellow, red and white, there may be additional camp types as well. 硬编码黄色,红色和白色,可能还会有其他营地类型。
  2. The dataframe is currently being filtered many times which could be improved. 该数据帧当前已被过滤很多次,可以改进。
  3. I'm not able to figure out how to calculate differently according to class calculation type (ie use sum/averge depending on the class type). 我无法弄清楚如何根据类计算类型进行不同的计算(即根据类类型使用sum / averge)。

Any help is appreciated. 任何帮助表示赞赏。

You could simply do the average and sum calculations for all combinations of Class/Camp and then parse the classRecord dataframe separately and extract what you need. 您可以简单地对Class / Camp的所有组合进行平均值和求和,然后分别解析classRecord数据帧并提取所需的内容。 You can do this easily in spark by using the groupBy() method and aggregate the values. 您可以使用groupBy()方法轻松地执行此操作并汇总值。

Using your example dataframe: 使用示例数据框:

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

studentRecord.show()

+-------+------+------+------+
|   Name|height|  Camp| Class|
+-------+------+------+------+
|   Shae|   152|yellow| first|
|    Joe|   140|yellow| first|
|   Mike|   149| white| first|
|   Anne|   142|   red| first|
|    Tim|   154|   red|Second|
|   Jake|   153| white|Second|
|Sherley|   153| white|Second|
+-------+------+------+------+

val df = studentRecord.groupBy("Class", "Camp")
  .agg(
    sum($"height").as("Sum"), 
    avg($"height").as("Average"), 
    collect_list($"Name").as("Names")
  )
df.show()

+------+------+---+-------+---------------+
| Class|  Camp|Sum|Average|          Names|
+------+------+---+-------+---------------+
| first| white|149|  149.0|         [Mike]|
| first|   red|142|  142.0|         [Anne]|
|Second|   red|154|  154.0|          [Tim]|
|Second| white|306|  153.0|[Jake, Sherley]|
| first|yellow|292|  146.0|    [Shae, Joe]|
+------+------+---+-------+---------------+

After doing this, you can simply check your first classRecord dataframe after which rows you need. 完成此操作后,您可以简单地检查您的第一个classRecord数据classRecord之后是所需的行。 Example of what it can look like, can be changed after your actual needs: 它的外观示例可以根据您的实际需求进行更改:

// Collects the dataframe as an Array[(String, String)]
val classRecs = classRecord.collect().map{case Row(clas: String, calc: String) => (clas, calc)}

for (classRec <- classRecs){
  val clas = classRec._1
  val calc = classRec._2

  // Matches which calculation you want to do
  val df2 = calc match {
    case "Average" => df.filter($"Class" === clas).select("Class", "Camp", "Average")
    case "Sum" => df.filter($"Class" === clas).select("Class", "Camp", "Sum")
  }

// Do something with df2
}

Hope it helps! 希望能帮助到你!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM