简体   繁体   中英

Scala Spark, how to add value to the column

My goal is to add a configurable constant value to a given column of a DataFrame.

val df = Seq(("A", 1), ("B", 2), ("C", 3)).toDF("col1", "col2")

+----+----+
|col1|col2|
+----+----+
|   A|   1|
|   B|   2|
|   C|   3|
+----+----+

To do so, I can define a UDF with a hard-coded number, as the following:

val add100 = udf( (x: Int) => x + 100)
df.withColumn("col3", add100($"col2")).show()

+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   1| 101|
|   B|   2| 102|
|   C|   3| 103|
+----+----+----+    

My question is, what's the best way to make the number (100 above) configurable?

I have tried the following way and it seems to work. But I was wondering is there any other better way to achieve the same operational result?

val addP = udf( (x: Int, p: Int) => x + p )
df.withColumn("col4", addP($"col2", lit(100)))

+----+----+----+
|col1|col2|col4|
+----+----+----+
|   A|   1| 101|
|   B|   2| 102|
|   C|   3| 103|
+----+----+----+

We don't need an udf here:

df.withColumn("col3", df("col2") + 100).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   1| 101|
|   B|   2| 102|
|   C|   3| 103|
+----+----+----+

You may define a curried function , pull extra parameters out and return a udf that takes only columns as parameters:

val addP = (p: Int) => udf( (x: Int) => x + p ) 
// addP: Int => org.apache.spark.sql.expressions.UserDefinedFunction = <function1>

df.withColumn("col3", addP(100)($"col2")).show
+----+----+----+
|col1|col2|col3|
+----+----+----+
|   A|   1| 101|
|   B|   2| 102|
|   C|   3| 103|
+----+----+----+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM