简体   繁体   中英

How to use variable arguments _* in udf with Scala/Spark?

I have a dataframe where the number of column is variable. Every column type is Int and I want to get sum of all column. thought of using :_* ,this is my code:

    val arr = Array(1,4,3,2,5,7,3,5,4,18)
    val input=new ArrayBuffer[(Int,Int)]()
    for(i<-0 until 10){
      input.append((i,arr(i%10)))
    }

    var df=sc.parallelize(input,3).toDF("value1","value2")
    val cols=new ArrayBuffer[Column]()
    val colNames=df.columns
    for(name<-colNames){
      cols.append(col(name))
    }
    val func = udf((s: Int*) => s.sum)
    df.withColumn("sum",func(cols:_*)).show()

But I get a error:

Error:(101, 27) ')' expected but identifier found.
  val func = udf((s: Int*) => s.sum) 
Error:(101, 27) ')' expected but identifier found.
  val func = udf((s: Int*) => s.sum)

how to use :_* in udf? my except result is:

+------+------+---+
|value1|value2|sum|
+------+------+---+
|     0|     1|  1|
|     1|     4|  5|
|     2|     3|  5|
|     3|     2|  5|
|     4|     5|  9|
|     5|     7| 12|
|     6|     3|  9|
|     7|     5| 12|
|     8|     4| 12|
|     9|    18| 27|
+------+------+---+

Spark UDF does not supports variable length arguments, Here is a solution for your problem.

import spark.implicits._

val input = Array(1,4,3,2,5,7,3,5,4,18).zipWithIndex

var df=spark.sparkContext.parallelize(input,3).toDF("value2","value1")

df.withColumn("total", df.columns.map(col(_)).reduce(_ + _))

Output:

+------+------+-----+
|value2|value1|total|
+------+------+-----+
|     1|     0|    1|
|     4|     1|    5|
|     3|     2|    5|
|     2|     3|    5|
|     5|     4|    9|
|     7|     5|   12|
|     3|     6|    9|
|     5|     7|   12|
|     4|     8|   12|
|    18|     9|   27|
+------+------+-----+

Hope this helps

This may what you expect

val func = udf((s: Seq[Int]) => s.sum)
df.withColumn("sum", func(array(cols: _*))).show()

where array is org.apache.spark.sql.functions.array which

Creates a new array column. The input columns must all have the same data type.

you can try VectorAssembler

import org.apache.spark.ml.feature.VectorAssembler
import breeze.linalg.DenseVector

val assembler = new VectorAssembler().
  setInputCols(Array("your column name")).
  setOutputCol("allNum")

val assembledDF = assembler.transform(df)

assembledDF.show

+------+------+----------+                                                      
|value1|value2|    allNum|
+------+------+----------+
|     0|     1| [0.0,1.0]|
|     1|     4| [1.0,4.0]|
|     2|     3| [2.0,3.0]|
|     3|     2| [3.0,2.0]|
|     4|     5| [4.0,5.0]|
|     5|     7| [5.0,7.0]|
|     6|     3| [6.0,3.0]|
|     7|     5| [7.0,5.0]|
|     8|     4| [8.0,4.0]|
|     9|    18|[9.0,18.0]|
+------+------+----------+

def yourSumUDF = udf((allNum:Vector) => new DenseVector(allNum.toArray).sum)
assembledDF.withColumn("sum", yourSumUDF($"allNum")).show

+------+------+----------+----+                      
|value1|value2|    allNum| sum|
+------+------+----------+----+
|     0|     1| [0.0,1.0]| 1.0|
|     1|     4| [1.0,4.0]| 5.0|
|     2|     3| [2.0,3.0]| 5.0|
|     3|     2| [3.0,2.0]| 5.0|
|     4|     5| [4.0,5.0]| 9.0|
|     5|     7| [5.0,7.0]|12.0|
|     6|     3| [6.0,3.0]| 9.0|
|     7|     5| [7.0,5.0]|12.0|
|     8|     4| [8.0,4.0]|12.0|
|     9|    18|[9.0,18.0]|27.0|
+------+------+----------+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM