如何在Scala / Spark中的udf中使用变量参数_ *？

Question

I have a dataframe where the number of column is variable. 我有一个数据框，其中列数是可变的。 Every column type is Int and I want to get sum of all column. 每个列类型都是Int，我想获取所有列的总和。 thought of using ：_* ,this is my code: 想到使用：_ *，这是我的代码：

    val arr = Array(1,4,3,2,5,7,3,5,4,18)
    val input=new ArrayBuffer[(Int,Int)]()
    for(i<-0 until 10){
      input.append((i,arr(i%10)))
    }

    var df=sc.parallelize(input,3).toDF("value1","value2")
    val cols=new ArrayBuffer[Column]()
    val colNames=df.columns
    for(name<-colNames){
      cols.append(col(name))
    }
    val func = udf((s: Int*) => s.sum)
    df.withColumn("sum",func(cols:_*)).show()

But I get a error: 但是我得到一个错误：

Error:(101, 27) ')' expected but identifier found.
  val func = udf((s: Int*) => s.sum) 
Error:(101, 27) ')' expected but identifier found.
  val func = udf((s: Int*) => s.sum)

how to use :_* in udf? 如何在udf中使用：_ *？ my except result is: 我的除结果是：

+------+------+---+
|value1|value2|sum|
+------+------+---+
|     0|     1|  1|
|     1|     4|  5|
|     2|     3|  5|
|     3|     2|  5|
|     4|     5|  9|
|     5|     7| 12|
|     6|     3|  9|
|     7|     5| 12|
|     8|     4| 12|
|     9|    18| 27|
+------+------+---+

Answer 1

Spark UDF does not supports variable length arguments, Here is a solution for your problem. Spark UDF不支持可变长度参数，这是您的问题的解决方案。

import spark.implicits._

val input = Array(1,4,3,2,5,7,3,5,4,18).zipWithIndex

var df=spark.sparkContext.parallelize(input,3).toDF("value2","value1")

df.withColumn("total", df.columns.map(col(_)).reduce(_ + _))

Output: 输出：

+------+------+-----+
|value2|value1|total|
+------+------+-----+
|     1|     0|    1|
|     4|     1|    5|
|     3|     2|    5|
|     2|     3|    5|
|     5|     4|    9|
|     7|     5|   12|
|     3|     6|    9|
|     5|     7|   12|
|     4|     8|   12|
|    18|     9|   27|
+------+------+-----+

Hope this helps 希望这可以帮助

Answer 2

This may what you expect 这可能是您期望的

val func = udf((s: Seq[Int]) => s.sum)
df.withColumn("sum", func(array(cols: _*))).show()

where array is org.apache.spark.sql.functions.array which 其中array是org.apache.spark.sql.functions.array其中

Creates a new array column. 创建一个新的数组列。 The input columns must all have the same data type. 输入列必须全部具有相同的数据类型。

Answer 3

you can try VectorAssembler 你可以尝试VectorAssembler

import org.apache.spark.ml.feature.VectorAssembler
import breeze.linalg.DenseVector

val assembler = new VectorAssembler().
  setInputCols(Array("your column name")).
  setOutputCol("allNum")

val assembledDF = assembler.transform(df)

assembledDF.show

+------+------+----------+                                                      
|value1|value2|    allNum|
+------+------+----------+
|     0|     1| [0.0,1.0]|
|     1|     4| [1.0,4.0]|
|     2|     3| [2.0,3.0]|
|     3|     2| [3.0,2.0]|
|     4|     5| [4.0,5.0]|
|     5|     7| [5.0,7.0]|
|     6|     3| [6.0,3.0]|
|     7|     5| [7.0,5.0]|
|     8|     4| [8.0,4.0]|
|     9|    18|[9.0,18.0]|
+------+------+----------+

def yourSumUDF = udf((allNum:Vector) => new DenseVector(allNum.toArray).sum)
assembledDF.withColumn("sum", yourSumUDF($"allNum")).show

+------+------+----------+----+                      
|value1|value2|    allNum| sum|
+------+------+----------+----+
|     0|     1| [0.0,1.0]| 1.0|
|     1|     4| [1.0,4.0]| 5.0|
|     2|     3| [2.0,3.0]| 5.0|
|     3|     2| [3.0,2.0]| 5.0|
|     4|     5| [4.0,5.0]| 9.0|
|     5|     7| [5.0,7.0]|12.0|
|     6|     3| [6.0,3.0]| 9.0|
|     7|     5| [7.0,5.0]|12.0|
|     8|     4| [8.0,4.0]|12.0|
|     9|    18|[9.0,18.0]|27.0|
+------+------+----------+----+

如何在Scala / Spark中的udf中使用变量参数_ *？

问题描述

3 个解决方案

解决方案1
2 已采纳 2017-07-25 02:13:57

解决方案2
2 2017-07-25 07:43:08

解决方案3
1 2017-07-25 05:59:16

如何在Scala / Spark中的udf中使用变量参数_ *？

问题描述

3 个解决方案

解决方案1 2 已采纳 2017-07-25 02:13:57

解决方案2 2 2017-07-25 07:43:08

解决方案3 1 2017-07-25 05:59:16

解决方案1
2 已采纳 2017-07-25 02:13:57

解决方案2
2 2017-07-25 07:43:08

解决方案3
1 2017-07-25 05:59:16