繁体   English   中英

将 Map Datatype 的新列添加到 Scala 中的 Spark Dataframe

[英]Add new column of Map Datatype to Spark Dataframe in scala

我能够创建一个具有 Map 数据类型的列的新数据框。

val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()), 
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1  |Visa     |1              |[]               |
|2  |MC       |2              |[]               |
+---+---------+---------------+-----------------+

现在我想创建一个与 card_type_details 类型相同的新列。 我正在尝试使用 spark withColumn 方法来添加这个新列。

inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)

+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details    |tmp  |
+---------+---------+---------------+---------------------+-----+
|1        |Visa     |1              |[]                   |null |
|2        |MC       |2              |[]                   |null |
+---------+---------+---------------+---------------------+-----+ 

当我检查两列的架构时,它是相同的,但值不同。

scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
 |-- id: integer (nullable = false)
 |-- card_type: string (nullable = true)
 |-- number_of_cards: integer (nullable = false)
 |-- card_type_details: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)
 |-- tmp: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

我不确定在添加新列时我是否做得正确。 当我在 tmp 列上应用 .isEmpty 方法时,问题就来了。 我收到空指针异常。

scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
     | var output_map = Map[String, Int]()
     | if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
     | else {output_map = card_type_details }
     | output_map
     | })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction

scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value   |
+---+---------+---------------+-----------------+--------+
|1  |Visa     |1              |[]               |[0 -> 1]|
|2  |MC       |2              |[]               |[0 -> 1]|
+---+---------+---------------+-----------------+--------+

scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)

Caused by: java.lang.NullPointerException
  at $anonfun$checkValue$1.apply(<console>:28)
  at $anonfun$checkValue$1.apply(<console>:26)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)

如何添加与 card_type_details 列具有相同值的新列。

要添加与card_type_details具有相同值的tmp列,您只需执行以下操作:

inputDF2.withColumn("tmp", col("cart_type_details"))

如果您打算添加带有空映射的列并避免NullPointerException ,则解决方案是:

inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM