在 DataFrame 中展平嵌套架构，获取 AnalysisException：无法解析列名

Question

我有一个 DF：

 -- str1: struct (nullable = true)
 |    |-- a1: string (nullable = true)
 |    |-- a2: string (nullable = true)
 |    |-- a3: string (nullable = true)
 |-- str2: string (nullable = true)
 |-- str3: string (nullable = true)
 |-- str4: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b1: string (nullable = true)
 |    |    |-- b2: string (nullable = true)
 |    |    |-- b3: boolean (nullable = true)
 |    |    |-- b4: struct (nullable = true)
 |    |    |    |-- c1: integer (nullable = true)
 |    |    |    |-- c2: string (nullable = true)
 |    |    |    |-- c3: integer (nullable = true)

我正在尝试将其展平，为此我使用了以下代码：

  def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
  {
    schema.fields.flatMap(f => {
      val colName = if (prefix == null) f.name else (prefix + "." + f.name)

      f.dataType match {
        case st: StructType => flattenSchema(st, colName)
        case at: ArrayType =>
          val st = at.elementType.asInstanceOf[StructType]
          flattenSchema(st, colName)
        case _ => Array(new Column(colName).as(colName))
      }
    })
  }


val d1 = df.select(flattenSchema(df.schema):_*)

它给了我下面的输出：

 |-- str1.a1: string (nullable = true)
 |-- str1.a2: string (nullable = true)
 |-- str1.a3: string (nullable = true)
 |-- str2: string (nullable = true)
 |-- str3: string (nullable = true)
 |-- str4.b1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b3: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b4.c1: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- str4.b4.c2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b4.c3: array (nullable = true)
 |    |-- element: integer (containsNull = true)

当我尝试查询时出现问题：

d1.select("str2").show -- 它没有给我任何问题

但是当我查询任何展平的嵌套列时

d1.select("str1.a1")

错误：

org.apache.spark.sql.AnalysisException: cannot resolve '`str1.a1`' given input columns: ....

我在这里做错了什么？ 或任何其他方式来达到预期的结果？

Answer 1

Spark 不支持带有dot(.) 的string类型列名。 点用于访问任何struct类型列的子列。 如果您尝试从数据帧df访问同一列，那么它应该可以工作，因为在df它是struct类型。

在 DataFrame 中展平嵌套架构，获取 AnalysisException：无法解析列名

问题描述

1 个解决方案

解决方案1
3 已采纳 2020-02-20 11:17:43

在 DataFrame 中展平嵌套架构，获取 AnalysisException：无法解析列名

问题描述

1 个解决方案

解决方案1 3 已采纳 2020-02-20 11:17:43

解决方案1
3 已采纳 2020-02-20 11:17:43