简体   繁体   English

将数据框的每一行转换为地图

[英]convert each row of dataframe to a map

I have a dataframe with columns A & B of type String. 我有一个数据框,其中A和B列的类型为String。 Let's assume the below dataframe 假设下面的数据框

+--------+
|A  | B  |
|1a | 1b |
|2a | 2b |

I want to add a third column that creates a map of A & B column 我想添加第三列,以创建A和B列的地图

+-------------------------+
|A  | B  |  C             |
|1a | 1b | {A->1a, B->1b} |
|2a | 2b | {A->2a, B->2b} |

I'm attempting to do it the following way. 我正在尝试通过以下方式进行操作。 I have udf which takes in a dataframe and returns a map 我有udf,它接受一个数据框并返回地图

val test = udf((dataFrame: DataFrame) => {
val result = new mutable.HashMap[String, String]
dataFrame.columns.foreach(col => {
  result.put(col, dataFrame(col).asInstanceOf[String])
})
result
})

I'm calling this udf in following way which is throwing a RunTimeException as I'm trying to pass a DataSet as a literal 我以以下方式调用此udf,因为我试图将DataSet作为文字传递,所以引发RunTimeException

df.withColumn("C", Helper.test(lit(df.select(df.columns.head, df.columns.tail: _*)))

I don't want to pass df('a') df('b') to my helper udf as I want them to be generic list of columns that I could select. 我不想将df('a')df('b')传递给我的助手udf,因为我希望它们成为我可以选择的通用列列表。 any pointers? 有指针吗?

map way 地图方式

You can just use map inbuilt function as 您可以将map 内置函数用作

import org.apache.spark.sql.functions._
val columns = df.columns
df.withColumn("C", map(columns.flatMap(x => Array(lit(x), col(x))): _*)).show(false)

which should give you 这应该给你

+---+---+---------------------+
|A  |B  |C                    |
+---+---+---------------------+
|1a |1b |Map(A -> 1a, B -> 1b)|
|2a |2b |Map(A -> 2a, B -> 2b)|
+---+---+---------------------+

Udf way udf方式

Or you can use define your udf as 或者您可以将udf定义为

//collecting column names to be used in the udf
val columns = df.columns
//definining udf function
import org.apache.spark.sql.functions._
def createMapUdf = udf((names: Seq[String], values: Seq[String])=> names.zip(values).toMap)
 //calling udf function 
df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*), array(col("A"), col("B")))).show(false)

I hope the answer is helpful 我希望答案是有帮助的

@ Ramesh Maharjan - Your answers are already great, my answer is just make your UDF answer also in dynamic way using string interpolation. @ Ramesh Maharjan-您的答案已经很好,我的答案是使用字符串插值以动态方式使您的UDF答案。

Column D is giving that in dynamic way. D栏以动态方式给出了这一点。

df.withColumn("C", createMapUdf(array(columns.map(x => lit(x)): _*), 
array(col("A"), col("B"))))
.withColumn("D", createMapUdf(array(columns.map(x => lit(x)): _*), 
array(columns.map(x => col(s"$x") ): _* ))).show()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM