明智地聚合数组元素

Question

Pretty new to spark/scala. spark/scala 非常新。 I am wondering if there is an easy way to aggregate an Array[Double] in a column-wise fashion.我想知道是否有一种简单的方法可以按列方式聚合 Array[Double]。 Here is an example:下面是一个例子：

c1   c2   c3
-------------------------
1     1   [1.0, 1.0, 3.4]
1     2   [1.0, 0,0, 4.3]
2     1   [0.0, 0.0, 0.0]
2     3   [1.2, 1.1, 1.1]

Then, upon aggregation, I would end with a table that looks like:然后，在聚合时，我会以一个看起来像这样的表结束：

c1   c3prime
-------------
1     [2.0, 1.0, 7.7]
2     [1.2, 1.1, 1.1]

Looking at UDAF now, but was wondering if I need to code at all?现在看 UDAF，但想知道我是否需要编码？

Thanks for your consideration.感谢您的考虑。

Answer 1

Assuming the array values of c3 are of the same size, you can sum the column element-wise by means of a UDF like below:假设c3的数组值具有相同的大小，您可以通过如下所示的 UDF 对列元素求和：

val df = Seq(
  (1, 1, Seq(1.0, 1.0, 3.4)),
  (1, 2, Seq(1.0, 0.0, 4.3)),
  (2, 1, Seq(0.0, 0.0, 0.0)),
  (2, 3, Seq(1.2, 1.1, 1.1))
).toDF("c1", "c2", "c3")

def elementSum = udf(
  (a: Seq[Seq[Double]]) => {
    val zeroSeq = Seq.fill[Double](a(0).size)(0.0)
    a.foldLeft(zeroSeq)(
      (a, x) => (a zip x).map{ case (u, v) => u + v }
    )
  }
)

val df2 = df.groupBy("c1").agg(
  elementSum(collect_list("c3")).as("c3prime")
)

df2.show(truncate=false)
// +---+-----------------------------+
// |c1 |c3prime                      |
// +---+-----------------------------+
// |1  |[2.0, 1.0, 7.699999999999999]|
// |2  |[1.2, 1.1, 1.1]              |
// +---+-----------------------------+

Answer 2

Here's one without UDF.这是一个没有 UDF 的。 It utilizes Spark's Window functions.它利用了 Spark 的 Window 函数。 Not sure how efficient it is, since it involves multiple groupBy s不确定它的效率如何，因为它涉及多个groupBy s

df.show

// +---+---+---------------+
// | c1| c2|             c3|
// +---+---+---------------+
// |  1|  1|[1.0, 1.0, 3.4]|
// |  1|  2|[1.0, 0.0, 4.3]|
// |  2|  1|[0.0, 0.0, 0.0]|
// |  2|  2|[1.2, 1.1, 1.1]|
// +---+---+---------------+

import org.apache.spark.sql.expressions.Window

val window = Window.partitionBy($"c1", $"c2").orderBy($"c1", $"c2")

df.withColumn("c3", explode($"c3") )
  .withColumn("rn", row_number() over window)
  .groupBy($"c1", $"rn").agg(sum($"c3").as("c3") )
  .orderBy($"c1", $"rn")
  .groupBy($"c1")
  .agg(collect_list($"c3").as("c3prime") ).show

// +---+--------------------+
// | c1|             c3prime|
// +---+--------------------+
// |  1|[2.0, 1.0, 7.6999...|
// |  2|     [1.2, 1.1, 1.1]|
// +---+--------------------+

Answer 3

You can combine some inbuilt functions such as groupBy , agg , sum , array , alias ( as ) etc. to get the desired final dataframe .您可以组合一些inbuilt functions例如groupBy 、 agg 、 sum 、 array 、 alias ( as ) 等，以获得所需的最终dataframe 。

import org.apache.spark.sql.functions._
df.groupBy("c1")
  .agg(sum($"c3"(0)).as("c3_1"), sum($"c3"(1)).as("c3_2"), sum($"c3"(2)).as("c3_3"))
  .select($"c1", array("c3_1","c3_2","c3_3").as("c3prime"))

I hope the answer is helpful.我希望答案有帮助。

明智地聚合数组元素

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-10-21 02:35:32

解决方案2
1 2017-10-21 05:20:36

解决方案3
1 2017-10-21 13:04:23

明智地聚合数组元素

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-10-21 02:35:32

解决方案2 1 2017-10-21 05:20:36

解决方案3 1 2017-10-21 13:04:23

解决方案1
2 已采纳 2017-10-21 02:35:32

解决方案2
1 2017-10-21 05:20:36

解决方案3
1 2017-10-21 13:04:23