简体   繁体   中英

Custom aggregations for Spark dataframes

I was wondering if there is some way to specify a custom aggregation function for Spark dataframes. If I have a table with 2 columns id and value I would like to groupBy id and aggregate the values into a list for each value like so:

from:

john | tomato
john | carrot
bill | apple
john | banana
bill | taco

to:

john | tomato, carrot, banana
bill | apple, taco

Is this possible in dataframes? I am asking about dataframes because I am reading data as an orc file and it is loaded as a dataframe. I would think it is in-efficient to convert it to a RDD.

I'd just go simply with the following :

import org.apache.spark.sql.functions.collect_list
val df = Seq(("john", "tomato"), ("john", "carrot"), 
             ("bill", "apple"), ("john", "banana"), 
             ("bill", "taco")).toDF("id", "value")
// df: org.apache.spark.sql.DataFrame = [id: string, value: string]

val aggDf = df.groupBy($"id").agg(collect_list($"value").as("values"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, values: array<string>]

aggDf.show(false)
// +----+------------------------+
// |id  |values                  |
// +----+------------------------+
// |john|[tomato, carrot, banana]|
// |bill|[apple, taco]           |
// +----+------------------------+

You won't even need to call the underlying rdd .

Reverting to RDD operations tends to work best for problems like this:

scala> val df = sc.parallelize(Seq(("john", "tomato"),
           ("john", "carrot"), ("bill", "apple"), 
           ("john", "bannana"), ("bill", "taco")))
           .toDF("name", "food")
df: org.apache.spark.sql.DataFrame = [name: string, food: string]

scala> df.show
+----+-------+
|name|   food|
+----+-------+
|john| tomato|
|john| carrot|
|bill|  apple|
|john|bannana|
|bill|   taco|
+----+-------+

scala> val aggregated = df.rdd
           .map{ case Row(k: String, v: String) => (k, List(v)) }
           .reduceByKey{_ ++ _}
           .toDF("name", "foods")
aggregated: org.apache.spark.sql.DataFrame = [name: string, foods: array<string>]

scala> aggregated.collect.foreach{println}
[john,WrappedArray(tomato, carrot, bannana)]
[bill,WrappedArray(apple, taco)]

As for efficiency, I believe DataFrames are RDD s under the hood so a conversion like .rdd has very little cost.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM