簡體   English   中英

將 Map Datatype 的新列添加到 Scala 中的 Spark Dataframe

[英]Add new column of Map Datatype to Spark Dataframe in scala

我能夠創建一個具有 Map 數據類型的列的新數據框。

val inputDF2 = Seq(
(1, "Visa", 1, Map[String, Int]()), 
(2, "MC", 2, Map[String, Int]())).toDF("id", "card_type", "number_of_cards", "card_type_details")
scala> inputDF2.show(false)
+---+---------+---------------+-----------------+
|id |card_type|number_of_cards|card_type_details|
+---+---------+---------------+-----------------+
|1  |Visa     |1              |[]               |
|2  |MC       |2              |[]               |
+---+---------+---------------+-----------------+

現在我想創建一個與 card_type_details 類型相同的新列。 我正在嘗試使用 spark withColumn 方法來添加這個新列。

inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").show(false)

+---------+---------+---------------+---------------------+-----+
|person_id|card_type|number_of_cards|card_type_details    |tmp  |
+---------+---------+---------------+---------------------+-----+
|1        |Visa     |1              |[]                   |null |
|2        |MC       |2              |[]                   |null |
+---------+---------+---------------+---------------------+-----+ 

當我檢查兩列的架構時,它是相同的,但值不同。

scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>").printSchema
root
 |-- id: integer (nullable = false)
 |-- card_type: string (nullable = true)
 |-- number_of_cards: integer (nullable = false)
 |-- card_type_details: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = false)
 |-- tmp: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

我不確定在添加新列時我是否做得正確。 當我在 tmp 列上應用 .isEmpty 方法時,問題就來了。 我收到空指針異常。

scala> def checkValue = udf((card_type_details: Map[String, Int]) => {
     | var output_map = Map[String, Int]()
     | if (card_type_details.isEmpty) { output_map += 0.toString -> 1 }
     | else {output_map = card_type_details }
     | output_map
     | })
checkValue: org.apache.spark.sql.expressions.UserDefinedFunction

scala> inputDF2.withColumn("value", checkValue(col("card_type_details"))).show(false)
+---+---------+---------------+-----------------+--------+
|id |card_type|number_of_cards|card_type_details|value   |
+---+---------+---------------+-----------------+--------+
|1  |Visa     |1              |[]               |[0 -> 1]|
|2  |MC       |2              |[]               |[0 -> 1]|
+---+---------+---------------+-----------------+--------+

scala> inputDF2.withColumn("tmp", lit(null) cast "map<String, Int>")
.withColumn("value", checkValue(col("tmp"))).show(false)

org.apache.spark.SparkException: Failed to execute user defined function($anonfun$checkValue$1: (map<string,int>) => map<string,int>)

Caused by: java.lang.NullPointerException
  at $anonfun$checkValue$1.apply(<console>:28)
  at $anonfun$checkValue$1.apply(<console>:26)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:108)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF$$anonfun$2.apply(ScalaUDF.scala:107)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1063)

如何添加與 card_type_details 列具有相同值的新列。

要添加與card_type_details具有相同值的tmp列,您只需執行以下操作:

inputDF2.withColumn("tmp", col("cart_type_details"))

如果您打算添加帶有空映射的列並避免NullPointerException ,則解決方案是:

inputDF2.withColumn("tmp", typedLit(Map.empty[Int, String]))

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM