無法在 Spark Dataframe 中將列拆分為多列

Question

無法在 Spark Data-frame 中和通過 RDD 將列拆分為多個列。 我嘗試了其他一些代碼，但僅適用於固定列。 前任：

數據類型為name:string, city =list(string)

我有一個文本文件，輸入數據如下

Name, city

A, (hyd,che,pune)

B, (che,bang,del)

C, (hyd)

所需的 Output 是：

A,hyd 

A,che

A,pune

B,che,

C,bang

B,del

C,hyd

讀取文本文件並轉換 DF 后。

數據框如下所示，

scala> data.show
+----------------+
|                 |
|           value |
|                 |
+----------------+

|      Name, city
|
|A,(hyd,che,pune)
|
|B,(che,bang,del)
|
|         C,(hyd)
|
|  D,(hyd,che,tn)|
+----------------+

Answer 1

您可以在 DataFrame 上使用explode function

val explodeDF = inputDF.withColumn("city", explode($"city")).show()

http://sqlandhadoop.com/spark-dataframe-explode/

現在我知道您正在將整行加載為字符串，這是有關如何實現 output 的解決方案

我定義了兩個用戶定義的函數

val split_to_two_strings: String => Array[String] = _.split(",",2) # to first split your input two elements to convert to two columns (name, city)
val custom_conv_to_Array: String => Array[String] = _.stripPrefix("(").stripSuffix(")").split(",") # strip ( and ) then convert to list of cities

import org.apache.spark.sql.functions.udf
val custom_conv_to_ArrayUDF = udf(custom_conv_to_Array)
val split_to_two_stringsUDF = udf(split_to_two_strings)


val outputDF = inputDF.withColumn("tmp", split_to_two_stringsUDF($"value"))
  .select($"tmp".getItem(0).as("Name"), trim($"tmp".getItem(1)).as("city_list"))
  .withColumn("city_array", custom_conv_to_ArrayUDF($"city_list"))
  .drop($"city_list")
  .withColumn("city", explode($"city_array"))
  .drop($"city_array")

outputDF.show()

希望這可以幫助

無法在 Spark Dataframe 中將列拆分為多列

問題描述

1 個解決方案

解決方案1
0 2019-10-23 11:43:20

無法在 Spark Dataframe 中將列拆分為多列

問題描述

1 個解決方案

解決方案1 0 2019-10-23 11:43:20

解決方案1
0 2019-10-23 11:43:20