在迭代記錄時將多個列添加到Spark Dataset

Question

在此處Spark2.1.x。 我有一堆JSON文件（具有相同的架構），正在將它們讀入單個Spark Dataset如下所示：

val ds = spark.read.json("some/path/to/lots/of/json/*.json")

然后，我可以打印ds模式，看看一切都已正確讀取：

ds.printSchema()

// Outputs:
root
 |-- fizz: boolean (nullable = true)
 |-- moniker: string (nullable = true)
 |-- buzz: string (nullable = true)
 |-- foo: string (nullable = true)
 |-- bar: string (nullable = true)

請注意moniker字符串列。 我現在要：

向此數據集和/或其架構添加三個新列； （a）一個稱為special_date的日期/時間列，（b）一個名為special_uuid的UUID列，以及（c）一個名為special_phrase的字符串列； 然后
我需要遍歷ds所有記錄，並為每條記錄將其moniker值傳遞給三個后續函數：（a） deriveSpecialDate(val moniker : String) : Date ，（b） deriveSpecialUuid(val moniker : String) : UUID和（c） deriveSpecialPhrase(val moniker : String) : String 。 然后，每個函數的輸出都必須成為相應列的記錄值。

我最好的嘗試：

val ds = spark.read.json("some/path/to/lots/of/json/*.json")

ds.foreach(record => {
  val moniker : String = record.select("moniker")
  val specialDate : Date = deriveSpecialDate(moniker)
  val specialUuid : UUID = deriveSpecialUuid(moniker)
  val specialPhrase : String = deriveSpecialPhrase(moniker)

  // This doesn't work because special_* fields don't exist in the original
  // schema dervied from the JSON files. We're ADDING these columns after the
  // JSON read and then populating their values dynamically.
  record.special_date = specialDate
  record.special_uuid = specialUuid
  record.special_phrase = specialPhrase
})

知道如何實現嗎？

Answer 1

我將使用來自Spark的udf（用戶定義的函數）將原始數據集分為3列

val deriveSpecialDate = udf((moniker: String) => // implement here)
val deriveSpecialUuid= udf((moniker: String) => // implement here)
val deriveSpecialPhrase = udf((moniker: String) => // implement here)

之后，您可以執行以下操作：

ds.withColumn("special_date", deriveSpecialDate(col("moniker)))
.withColumn("special_uuid", deriveSpecialUuid(col("moniker)))
.withColumn("special_phrase", deriveSpecialPhrase (col("moniker)))

這將為您帶來三列的新數據框。 如果需要，還可以使用map函數轉換為數據集

Answer 2

要創建新列，可以使用withColumn。 並且，如果您已經擁有一個功能，則需要將該功能注冊為UDF（用戶定義的功能）

val sd = sqlContext.udf.register("deriveSpecialDate",deriveSpecialDate _ )
val su = sqlContext.udf.register("deriveSpecialUuid",deriveSpecialUuid _ )
val sp = sqlContext.udf.register("deriveSpecialPhrase", deriveSpecialPhrase _)

要使用此udf，您需要withcolumn，它會創建一個新列

ds.withColumn("special_date", sd($"moniker))
 .withColumn("special_uuid", su($"moniker))
 .withColumn("special_phrase", sp($"moniker))

這樣，您將獲得帶有三個新添加列的原始數據集。

在迭代記錄時將多個列添加到Spark Dataset

問題描述

2 個解決方案

解決方案1
1 已采納 2017-07-27 14:25:48

解決方案2
0 2017-07-27 14:48:45

在迭代記錄時將多個列添加到Spark Dataset

問題描述

2 個解決方案

解決方案1 1 已采納 2017-07-27 14:25:48

解決方案2 0 2017-07-27 14:48:45

解決方案1
1 已采納 2017-07-27 14:25:48

解決方案2
0 2017-07-27 14:48:45