加速Spark数据框组

Question

I am fairly inexperienced in Spark, and need help with groupBy and aggregate functions on a dataframe. 我在Spark方面经验不足，需要在数据框上使用groupBy和聚合函数方面的帮助。 Consider the following dataframe: 考虑以下数据帧：

val df = (Seq((1, "a", "1"),
              (1,"b", "3"),
              (1,"c", "6"),
              (2, "a", "9"),
              (2,"c", "10"),
              (1,"b","8" ),
              (2, "c", "3"),
              (3,"r", "19")).toDF("col1", "col2", "col3"))

df.show()
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   1|   a|   1|
|   1|   b|   3|
|   1|   c|   6|
|   2|   a|   9|
|   2|   c|  10|
|   1|   b|   8|
|   2|   c|   3|
|   3|   r|  19|
+----+----+----+

I need to group by col1 and col2 and calculate the mean of col3, which I can do using: 我需要按col1和col2分组，并计算col3的平均值，我可以使用以下方法：

val col1df = df.groupBy("col1").agg(round(mean("col3"),2).alias("mean_col1"))
val col2df = df.groupBy("col2").agg(round(mean("col3"),2).alias("mean_col2"))

However, on a large dataframe with a few million rows and tens of thousands of unique elements in the columns to group by, it takes a very long time. 但是，在一个大型数据框上有几百万行，列中有成千上万的唯一元素进行分组时，这需要很长时间。 Besides, I have many more columns to group by and it takes insanely long, which I am looking to reduce. 此外，我还有许多要分组的列，而且耗时很长，我希望减少这一列。 Is there a better way to do the groupBy followed by the aggregation? 有没有更好的方法来执行groupBy，然后进行聚合？

Answer 1

You could use ideas from Multiple Aggregations , it might do everything in one shuffle operations, which is the most expensive operation. 您可以使用来自多个集合的想法，它可以在一次随机操作中完成所有操作，这是最昂贵的操作。

Example: 例：

val df = (Seq((1, "a", "1"),
(1,"b", "3"),
(1,"c", "6"),
(2, "a", "9"),
(2,"c", "10"),
(1,"b","8" ),
(2, "c", "3"),
(3,"r", "19")).toDF("col1", "col2", "col3"))

df.createOrReplaceTempView("data")

val grpRes = spark.sql("""select grouping_id() as gid, col1, col2, round(mean(col3), 2) as res 
                          from data group by col1, col2 grouping sets ((col1), (col2)) """)

grpRes.show(100, false)

Output: 输出：

+---+----+----+----+
|gid|col1|col2|res |
+---+----+----+----+
|1  |3   |null|19.0|
|2  |null|b   |5.5 |
|2  |null|c   |6.33|
|1  |1   |null|4.5 |
|2  |null|a   |5.0 |
|1  |2   |null|7.33|
|2  |null|r   |19.0|
+---+----+----+----+

gid is a bit funny to use, as it has some binary calculations underneath. gid有点有趣，因为它下面有一些二进制计算。 But if your grouping columns can not have nulls, than you can use it for selecting the correct groups. 但是，如果您的分组列不能有空值，那么您可以使用它来选择正确的组。

Execution Plan: 执行计划：

scala> grpRes.explain
== Physical Plan ==
*(2) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[avg(cast(col3#9 as double))])
+- Exchange hashpartitioning(col1#111, col2#112, spark_grouping_id#108, 200)
   +- *(1) HashAggregate(keys=[col1#111, col2#112, spark_grouping_id#108], functions=[partial_avg(cast(col3#9 as double))])
      +- *(1) Expand [List(col3#9, col1#109, null, 1), List(col3#9, null, col2#110, 2)], [col3#9, col1#111, col2#112, spark_grouping_id#108]
         +- LocalTableScan [col3#9, col1#109, col2#110]

As you can see there is single Exchange operation, the expensive shuffle. 如您所见，只有一次Exchange操作，这是昂贵的洗牌。

加速Spark数据框组

问题描述

1 个解决方案

解决方案1
2 2018-11-07 00:21:08

加速Spark数据框组

问题描述

1 个解决方案

解决方案1 2 2018-11-07 00:21:08

解决方案1
2 2018-11-07 00:21:08