Split Spark dataframe and calculate average based on one column value

Question

I have two dataframes, the first dataframe classRecord has 10 different entries like the following:

Class, Calculation
first, Average
Second, Sum
Third, Average

Second dataframe studentRecord has around 50K entries like the following:

Name, height, Camp, Class
Shae, 152, yellow, first
Joe, 140, yellow, first
Mike, 149, white, first
Anne, 142, red, first
Tim, 154, red, Second
Jake, 153, white, Second
Sherley, 153, white, Second

From second dataframe, based on class type, I would like to perform calculation on height (for class first: average, for class second: sum, etc.) based on the camp separately (if class is first, avg of yellow, white and so on separately). I tried the following code:

//function to calculate average
def averageOnName(splitFrame : org.apache.spark.sql.DataFrame ) : Array[(String, Double)] = {
  val pairedRDD: RDD[(String, Double)] = splitFrame.select($"Name",$"height".cast("double")).as[(String, Double)].rdd
  var avg_by_key = pairedRDD.mapValues(x => (x, 1)).reduceByKey((x, y) => (x._1 + y._1, x._2 + y._2)).mapValues(y => 1.0 * y._1 / y._2).collect
  return avg_by_key
}

//required schema for further modifications
val schema = StructType(
StructField("name", StringType, false) ::
StructField("avg", DoubleType, false) :: Nil)

// for each loop on each class type
classRecord.rdd.foreach{
  //filter students based on camps
  var campYellow =studentRecord.filter($"Camp" === "yellow")
  var campWhite =studentRecord.filter($"Camp" === "white")
  var campRed =studentRecord.filter($"Camp" === "red")

  // since I know that calculation for first class is average, so representing calculation only for class first
  val avgcampYellow  =  averageOnName(campYellow)
  val avgcampWhite   =  averageOnName(campWhite)
  val avgcampRed   =  averageOnName(campRed)

  // union of all
  val rddYellow = sc.parallelize (avgcampYellow).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
  //conversion of rdd to frame
  var dfYellow = sqlContext.createDataFrame(rddYellow, schema)
  //union with yellow camp data
  val rddWhite = sc.parallelize (avgcampWhite).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
  //conversion of rdd to frame
  var dfWhite = sqlContext.createDataFrame(rddWhite, schema)
  var dfYellWhite = dfYellow.union(dfWhite)
  //union with yellow,white camp data
  val rddRed = sc.parallelize (avgcampRed).map (x => org.apache.spark.sql.Row(x._1, x._2.asInstanceOf[Number].doubleValue()))
  //conversion of rdd to frame
  var dfRed = sqlContext.createDataFrame(rddRed, schema)
  var dfYellWhiteRed = dfYellWhite .union(dfRed)
  // other modifications and final result to hive
}

Here I am struggling with:

Hardcoding yellow, red and white, there may be additional camp types as well.
The dataframe is currently being filtered many times which could be improved.
I'm not able to figure out how to calculate differently according to class calculation type (ie use sum/averge depending on the class type).

Any help is appreciated.

Answer 1

You could simply do the average and sum calculations for all combinations of Class/Camp and then parse the classRecord dataframe separately and extract what you need. You can do this easily in spark by using the groupBy() method and aggregate the values.

Using your example dataframe:

val spark = SparkSession.builder.getOrCreate()
import spark.implicits._

studentRecord.show()

+-------+------+------+------+
|   Name|height|  Camp| Class|
+-------+------+------+------+
|   Shae|   152|yellow| first|
|    Joe|   140|yellow| first|
|   Mike|   149| white| first|
|   Anne|   142|   red| first|
|    Tim|   154|   red|Second|
|   Jake|   153| white|Second|
|Sherley|   153| white|Second|
+-------+------+------+------+

val df = studentRecord.groupBy("Class", "Camp")
  .agg(
    sum($"height").as("Sum"), 
    avg($"height").as("Average"), 
    collect_list($"Name").as("Names")
  )
df.show()

+------+------+---+-------+---------------+
| Class|  Camp|Sum|Average|          Names|
+------+------+---+-------+---------------+
| first| white|149|  149.0|         [Mike]|
| first|   red|142|  142.0|         [Anne]|
|Second|   red|154|  154.0|          [Tim]|
|Second| white|306|  153.0|[Jake, Sherley]|
| first|yellow|292|  146.0|    [Shae, Joe]|
+------+------+---+-------+---------------+

After doing this, you can simply check your first classRecord dataframe after which rows you need. Example of what it can look like, can be changed after your actual needs:

// Collects the dataframe as an Array[(String, String)]
val classRecs = classRecord.collect().map{case Row(clas: String, calc: String) => (clas, calc)}

for (classRec <- classRecs){
  val clas = classRec._1
  val calc = classRec._2

  // Matches which calculation you want to do
  val df2 = calc match {
    case "Average" => df.filter($"Class" === clas).select("Class", "Camp", "Average")
    case "Sum" => df.filter($"Class" === clas).select("Class", "Camp", "Sum")
  }

// Do something with df2
}

Hope it helps!

Split Spark dataframe and calculate average based on one column value

Question

1 answers

solution1
0 ACCPTED 2017-09-15 03:46:07

Split Spark dataframe and calculate average based on one column value

Question

1 answers

solution1 0 ACCPTED 2017-09-15 03:46:07

solution1
0 ACCPTED 2017-09-15 03:46:07