I was wondering if there is some way to specify a custom aggregation function for Spark dataframes. If I have a table with 2 columns id
and value
I would like to groupBy id
and aggregate the values into a list for each value
like so:
from:
john | tomato
john | carrot
bill | apple
john | banana
bill | taco
to:
john | tomato, carrot, banana
bill | apple, taco
Is this possible in dataframes? I am asking about dataframes because I am reading data as an orc file and it is loaded as a dataframe. I would think it is in-efficient to convert it to a RDD.
I'd just go simply with the following :
import org.apache.spark.sql.functions.collect_list
val df = Seq(("john", "tomato"), ("john", "carrot"),
("bill", "apple"), ("john", "banana"),
("bill", "taco")).toDF("id", "value")
// df: org.apache.spark.sql.DataFrame = [id: string, value: string]
val aggDf = df.groupBy($"id").agg(collect_list($"value").as("values"))
// aggDf: org.apache.spark.sql.DataFrame = [id: string, values: array<string>]
aggDf.show(false)
// +----+------------------------+
// |id |values |
// +----+------------------------+
// |john|[tomato, carrot, banana]|
// |bill|[apple, taco] |
// +----+------------------------+
You won't even need to call the underlying rdd
.
Reverting to RDD
operations tends to work best for problems like this:
scala> val df = sc.parallelize(Seq(("john", "tomato"),
("john", "carrot"), ("bill", "apple"),
("john", "bannana"), ("bill", "taco")))
.toDF("name", "food")
df: org.apache.spark.sql.DataFrame = [name: string, food: string]
scala> df.show
+----+-------+
|name| food|
+----+-------+
|john| tomato|
|john| carrot|
|bill| apple|
|john|bannana|
|bill| taco|
+----+-------+
scala> val aggregated = df.rdd
.map{ case Row(k: String, v: String) => (k, List(v)) }
.reduceByKey{_ ++ _}
.toDF("name", "foods")
aggregated: org.apache.spark.sql.DataFrame = [name: string, foods: array<string>]
scala> aggregated.collect.foreach{println}
[john,WrappedArray(tomato, carrot, bannana)]
[bill,WrappedArray(apple, taco)]
As for efficiency, I believe DataFrames
are RDD
s under the hood so a conversion like .rdd
has very little cost.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.