具有不同列的两个 Spark 数据帧的联合

Question

I am trying to union two Spark dataframes with different set of columns.我正在尝试将两个 Spark 数据帧与不同的列集结合起来。 For this purpose, I referred to following link:-为此，我参考了以下链接：-

How to perform union on two DataFrames with different amounts of columns in spark? 如何在 spark 中对具有不同列数的两个 DataFrame 执行联合？

My code is as follows -我的代码如下 -

val cols1 = finalDF.columns.toSet
val cols2 = df.columns.toSet
val total = cols1 ++ cols2 
finalDF=finalDF.select(expr(cols1, total):_*).unionAll(df.select(expr(cols2, total):_*))

def expr(myCols: Set[String], allCols: Set[String]) = {
  allCols.toList.map(x => x match {
    case x if myCols.contains(x) => col(x)
    case _ => lit(null).as(x)
  })
}

But the problem I am facing is some of the columns in both dataframes are nested.但我面临的问题是两个数据框中的某些列是嵌套的。 I've columns of both StructType and primitive types.我有 StructType 和原始类型的列。 Now, say column A (of StructType) is in df and not in finalDF.现在，假设 A 列（属于 StructType）在 df 中而不是在 finalDF 中。 But in expr,但是在表达式中，

case _ => lit(null).as(x)

is not making it StructType.没有使它成为 StructType。 That's why I am not able to union them.这就是为什么我无法将它们联合起来。 It is giving me following error -它给我以下错误 -

org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. NullType <> StructType(StructField(_VALUE,StringType,true), StructField(_id,LongType,true)) at the first column of the second table.

Any suggestions what I can do here?有什么建议我可以在这里做什么？

Answer 1

I'd use built-in schema inference for this.我会为此使用内置模式推理。 It is way more expensive , but much simpler than matching complex structures, with possible conflicts:它的成本更高，但比匹配复杂结构要简单得多，可能会发生冲突：

spark.read.json(df1.toJSON.union(df2.toJSON))

You can also import all files at the same time, and join with information extracted from header, using input_file_name .您还可以同时导入所有文件，并使用input_file_name join从标头中提取的信息。

import org.apache.spark.sql.function

val metadata: DataFrame  // Just metadata from the header
val data: DataFrame      // All files loaded together

metadata.withColumn("file", input_file_name)
  .join(data.withColumn("file", input_file_name), Seq("file"))

Answer 2

df = df1.join(df2, ['each', 'shared', 'column'], how='full')

will fill missing data with nulls.将用空值填充缺失数据。

具有不同列的两个 Spark 数据帧的联合

问题描述

2 个解决方案

解决方案1
2 已采纳 2017-07-30 11:23:26

解决方案2
0 2020-08-13 17:58:39

具有不同列的两个 Spark 数据帧的联合

问题描述

2 个解决方案

解决方案1 2 已采纳 2017-07-30 11:23:26

解决方案2 0 2020-08-13 17:58:39

解决方案1
2 已采纳 2017-07-30 11:23:26

解决方案2
0 2020-08-13 17:58:39