繁体   English   中英

在 DataFrame 中展平嵌套架构,获取 AnalysisException:无法解析列名

[英]Flatten Nested schema in DataFrame, getting AnalysisException: cannot resolve column name

我有一个 DF:

 -- str1: struct (nullable = true)
 |    |-- a1: string (nullable = true)
 |    |-- a2: string (nullable = true)
 |    |-- a3: string (nullable = true)
 |-- str2: string (nullable = true)
 |-- str3: string (nullable = true)
 |-- str4: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- b1: string (nullable = true)
 |    |    |-- b2: string (nullable = true)
 |    |    |-- b3: boolean (nullable = true)
 |    |    |-- b4: struct (nullable = true)
 |    |    |    |-- c1: integer (nullable = true)
 |    |    |    |-- c2: string (nullable = true)
 |    |    |    |-- c3: integer (nullable = true)

我正在尝试将其展平,为此我使用了以下代码:

  def flattenSchema(schema: StructType, prefix: String = null):Array[Column]=
  {
    schema.fields.flatMap(f => {
      val colName = if (prefix == null) f.name else (prefix + "." + f.name)

      f.dataType match {
        case st: StructType => flattenSchema(st, colName)
        case at: ArrayType =>
          val st = at.elementType.asInstanceOf[StructType]
          flattenSchema(st, colName)
        case _ => Array(new Column(colName).as(colName))
      }
    })
  }


val d1 = df.select(flattenSchema(df.schema):_*)

它给了我下面的输出:

 |-- str1.a1: string (nullable = true)
 |-- str1.a2: string (nullable = true)
 |-- str1.a3: string (nullable = true)
 |-- str2: string (nullable = true)
 |-- str3: string (nullable = true)
 |-- str4.b1: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b3: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b4.c1: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- str4.b4.c2: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- str4.b4.c3: array (nullable = true)
 |    |-- element: integer (containsNull = true)

当我尝试查询时出现问题:

d1.select("str2").show -- 它没有给我任何问题

但是当我查询任何展平的嵌套列时

d1.select("str1.a1")

错误:

org.apache.spark.sql.AnalysisException: cannot resolve '`str1.a1`' given input columns: ....

我在这里做错了什么? 或任何其他方式来达到预期的结果?

Spark 不支持带有dot(.) 的string类型列名。 点用于访问任何struct类型列的子列。 如果您尝试从数据帧df访问同一列,那么它应该可以工作,因为在df它是struct类型。

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM