I have 200 Mil rows with 1K groups looking like this
Group X Y Z Q W
group1 0.054464866 0.002248819 0.299069804 0.763352879 0.395905106
group2 0.9986218 0.023649037 0.50762069 0.212225807 0.619571705
group1 0.839928517 0.290339179 0.050407454 0.75837838 0.495466007
group1 0.021003132 0.663366686 0.687928832 0.239132224 0.020848608
group1 0.393843426 0.006299292 0.141103438 0.858481036 0.715860852
group2 0.045960198 0.014858905 0.672267793 0.59750871 0.893646818
I want to run the same function (say linear regression
of X
on [X, Z, Q, W]
) for each of the groups. I could have done Window.partition
etc. but I have my own function. At the moment, I do the following:
df.select("Group").distinct.collect.toList.foreach{group =>
val dfGroup = df.filter(col("Group")===group
dfGroup.withColumn("res", myUdf(col("X"), col("Y"), col("Z"), col("Q"), col("W"))}
Wonder if there is a better way to do?
You have minimum two options depending what you prefer: DataFrame or Dataset.
df
.groupBy("group")
.agg(myUdaf(col("col1"), col("col2")))
where myUdaf
is UDAF
Here you can find example how to implement UDAF: https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html
You can use groupByKey
and mapGroups
transformations from Dataset API:
ds
.groupByKey(_.group)
.mapGroups{case (group, values) =>
(group, aggregator(values))
}
where aggregator
is Scala function responsible for aggregating collection of objects.
If you don't need aggregating you can just map values
using map
transformation, example:
values.map(v => fun(...))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.