I have a dataframe with columns A & B of type String. Let's assume the below dataframe
+--------+
|A | B |
|1a | 1b |
|2a | 2b |
I want to add a third column that creates a map of A & B column
+-------------------------+
|A | B | C |
|1a | 1b | {A->1a, B->1b} |
|2a | 2b | {A->2a, B->2b} |
I'm attempting to do it the following way. I have udf which takes in a dataframe and returns a map
val test = udf((dataFrame: DataFrame) => {
val result = new mutable.HashMap[String, String]
dataFrame.columns.foreach(col => {
result.put(col, dataFrame(col).asInstanceOf[String])
})
result
})
I'm calling this udf in following way which is throwing a RunTimeException as I'm trying to pass a DataSet as a literal
df.withColumn("C", Helper.test(lit(df.select(df.columns.head, df.columns.tail: _*)))
I don't want to pass df('a') df('b') to my helper udf as I want them to be generic list of columns that I could select. any pointers?
map way
You can just use map
inbuilt function as
import org.apache.spark.sql.functions._
val columns = df.columns
df.withColumn("C", map(columns.flatMap(x => Array(lit(x), col(x))): _*)).show(false)
which should give you
+---+---+---------------------+
|A |B |C |
+---+---+---------------------+
|1a |1b |Map(A -> 1a, B -> 1b)|
|2a |2b |Map(A -> 2a, B -> 2b)|
+---+---+---------------------+
Udf way
Or you can use define your udf
as
//collecting column names to be used in the udf
val columns = df.columns
//definining udf function
import org.apache.spark.sql.functions._
def createMapUdf = udf((names: Seq[String], values: Seq[String])=> names.zip(values).toMap)
//calling udf function
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*), array(col("A"), col("B")))).show(false)
I hope the answer is helpful
@ Ramesh Maharjan - Your answers are already great, my answer is just make your UDF answer also in dynamic way using string interpolation.
Column D
is giving that in dynamic way.
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*),
array(col("A"), col("B"))))
.withColumn("D", createMapUdf(array(columns.map(x => lit(x)): _*),
array(columns.map(x => col(s"$x") ): _* ))).show()
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.