具有不同列的兩個 Spark 數據幀的聯合

Question

我正在嘗試將兩個 Spark 數據幀與不同的列集結合起來。 為此，我參考了以下鏈接：-

如何在 spark 中對具有不同列數的兩個 DataFrame 執行聯合？

我的代碼如下 -

val cols1 = finalDF.columns.toSet
val cols2 = df.columns.toSet
val total = cols1 ++ cols2 
finalDF=finalDF.select(expr(cols1, total):_*).unionAll(df.select(expr(cols2, total):_*))

def expr(myCols: Set[String], allCols: Set[String]) = {
  allCols.toList.map(x => x match {
    case x if myCols.contains(x) => col(x)
    case _ => lit(null).as(x)
  })
}

但我面臨的問題是兩個數據框中的某些列是嵌套的。 我有 StructType 和原始類型的列。 現在，假設 A 列（屬於 StructType）在 df 中而不是在 finalDF 中。 但是在表達式中，

case _ => lit(null).as(x)

沒有使它成為 StructType。 這就是為什么我無法將它們聯合起來。 它給我以下錯誤 -

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. NullType <> StructType(StructField(_VALUE,StringType,true), StructField(_id,LongType,true)) at the first column of the second table.

有什么建議我可以在這里做什么？

Answer 1

我會為此使用內置模式推理。 它的成本更高，但比匹配復雜結構要簡單得多，可能會發生沖突：

spark.read.json(df1.toJSON.union(df2.toJSON))

您還可以同時導入所有文件，並使用input_file_name join從標頭中提取的信息。

import org.apache.spark.sql.function

val metadata: DataFrame  // Just metadata from the header
val data: DataFrame      // All files loaded together

metadata.withColumn("file", input_file_name)
  .join(data.withColumn("file", input_file_name), Seq("file"))

Answer 2

df = df1.join(df2, ['each', 'shared', 'column'], how='full')

將用空值填充缺失數據。

具有不同列的兩個 Spark 數據幀的聯合

問題描述

2 個解決方案

解決方案1
2 已采納 2017-07-30 11:23:26

解決方案2
0 2020-08-13 17:58:39

具有不同列的兩個 Spark 數據幀的聯合

問題描述

2 個解決方案

解決方案1 2 已采納 2017-07-30 11:23:26

解決方案2 0 2020-08-13 17:58:39

解決方案1
2 已采納 2017-07-30 11:23:26

解決方案2
0 2020-08-13 17:58:39