僅展平 Scala Spark 數據幀中的最深級別

Question

我有一個 Spark 作業，它有一個具有以下值的 DataFrame：

{
  "id": "abchchd",
  "test_id": "ndsbsb",
  "props": {
    "type": {
      "isMale": true,
      "id": "dd",
      "mcc": 1234,
      "name": "Adam"
    }
  }
}

{
  "id": "abc",
  "test_id": "asf",
  "props": {
    "type2": {
      "isMale": true,
      "id": "dd",
      "mcc": 12134,
      "name": "Perth"
    }
  }
}

我想優雅地將它展平（因為沒有未知的鍵和類型等），這樣道具仍然是一個struct但它里面的所有東西都被展平了（不管嵌套的級別如何）

所需的輸出是：

{
  "id": "abchchd",
  "test_id": "ndsbsb",
  "props": {
    "type.isMale": true,
    "type.id": "dd",
    "type.mcc": 1234,
    "type.name": "Adam"
  }
}

{
  "id": "abc",
  "test_id": "asf",
  "props": {
      "type2.isMale": true,
      "type2.id": "dd",
      "type2.mcc": 12134,
      "type2.name": "Perth"
  }
}

我使用了Automatically and Elegantly flatten DataFrame in Spark SQL 中提到的解決方案

但是，我無法保持道具場完好無損。 它也會被壓扁。 有人可以幫助我擴展此解決方案嗎？

最終的架構應該是這樣的：

root
 |-- id: string (nullable = true)
 |-- props: struct (nullable = true)
 |    |-- type.id: string (nullable = true)
 |    |-- type.isMale: boolean (nullable = true)
 |    |-- type.mcc: long (nullable = true)
 |    |-- type.name: string (nullable = true)
      |-- type2.id: string (nullable = true)
 |    |-- type2.isMale: boolean (nullable = true)
 |    |-- type2.mcc: long (nullable = true)
 |    |-- type2.name: string (nullable = true)
 |-- test_id: string (nullable = true)

Answer 1

我已經能夠使用 RDD API 實現這一點：

val jsonRDD = df.rdd.map{row =>
  def unnest(r: Row): Map[String, Any] = {
    r.schema.fields.zipWithIndex.flatMap{case (f, i) =>
      (f.name, f.dataType) match {
        case ("props", _:StructType) =>
          val propsObject = r.getAs[Row](f.name)
          Map(f.name -> propsObject.schema.fields.flatMap{propsAttr =>
            val subObject = propsObject.getAs[Row](propsAttr.name)
            subObject.schema.fields.map{subField =>
              s"${propsAttr.name}.${subField.name}" -> subObject.get(subObject.fieldIndex(subField.name))
            }
          }.toMap)
        case (fname, _: StructType) => Map(fname -> unnest(r.getAs[Row](fname)))
        case (fname, ArrayType(_: StructType,_)) => Map(fname -> r.getAs[Seq[Row]](fname).map(unnest))
        case _ => Map(f.name -> r.get(i))
      }
    }
  }.toMap

  val asMap = unnest(row)
  new ObjectMapper().registerModule(DefaultScalaModule).writeValueAsString(asMap)
}

val finalDF = spark.read.json(jsonRDD.toDS).cache

由於遞歸，解決方案應該接受深度嵌套的輸入。

有了你的數據，我們得到了：

finalDF.printSchema()
finalDF.show(false)
finalDF.select("props.*").show()

輸出：

root
 |-- id: string (nullable = true)
 |-- props: struct (nullable = true)
 |    |-- type.id: string (nullable = true)
 |    |-- type.isMale: boolean (nullable = true)
 |    |-- type.mcc: long (nullable = true)
 |    |-- type.name: string (nullable = true)
 |-- test_id: string (nullable = true)

+-------+----------------------+-------+
|id     |props                 |test_id|
+-------+----------------------+-------+
|abchchd|[dd, true, 1234, Adam]|ndsbsb |
+-------+----------------------+-------+

+-------+-----------+--------+---------+
|type.id|type.isMale|type.mcc|type.name|
+-------+-----------+--------+---------+
|     dd|       true|    1234|     Adam|
+-------+-----------+--------+---------+

但是我們也可以傳遞更多嵌套/復雜的結構，例如：

val str2 = """{"newroot":[{"mystruct":{"id":"abchchd","test_id":"ndsbsb","props":{"type":{"isMale":true,"id":"dd","mcc":1234,"name":"Adam"}}}}]}"""

...

finalDF.printSchema()
finalDF.show(false)

給出以下輸出：

root
 |-- newroot: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- mystruct: struct (nullable = true)
 |    |    |    |-- id: string (nullable = true)
 |    |    |    |-- props: struct (nullable = true)
 |    |    |    |    |-- type.id: string (nullable = true)
 |    |    |    |    |-- type.isMale: boolean (nullable = true)
 |    |    |    |    |-- type.mcc: long (nullable = true)
 |    |    |    |    |-- type.name: string (nullable = true)
 |    |    |    |-- test_id: string (nullable = true)

+---------------------------------------------+
|root                                         |
+---------------------------------------------+
|[[[abchchd, [dd, true, 1234, Adam], ndsbsb]]]|
+---------------------------------------------+

編輯：正如您所提到的，如果您有不同結構的記錄，您需要將上述subObject值包裝在一個選項中。
這是固定的unnest函數：

def unnest(r: Row): Map[String, Any] = {
  r.schema.fields.zipWithIndex.flatMap{case (f, i) =>
    (f.name, f.dataType) match {
      case ("props", _:StructType) =>
        val propsObject = r.getAs[Row](f.name)
        Map(f.name -> propsObject.schema.fields.flatMap{propsAttr =>
          val subObjectOpt = Option(propsObject.getAs[Row](propsAttr.name))
          subObjectOpt.toSeq.flatMap{subObject => subObject.schema.fields.map{subField =>
            s"${propsAttr.name}.${subField.name}" -> subObject.get(subObject.fieldIndex(subField.name))
          }}
        }.toMap)
      case (fname, _: StructType) => Map(fname -> unnest(r.getAs[Row](fname)))
      case (fname, ArrayType(_: StructType,_)) => Map(fname -> r.getAs[Seq[Row]](fname).map(unnest))
      case _ => Map(f.name -> r.get(i))
    }
  }
}.toMap

新的printSchema給出：

root
 |-- id: string (nullable = true)
 |-- props: struct (nullable = true)
 |    |-- type.id: string (nullable = true)
 |    |-- type.isMale: boolean (nullable = true)
 |    |-- type.mcc: long (nullable = true)
 |    |-- type.name: string (nullable = true)
 |    |-- type2.id: string (nullable = true)
 |    |-- type2.isMale: boolean (nullable = true)
 |    |-- type2.mcc: long (nullable = true)
 |    |-- type2.name: string (nullable = true)
 |-- test_id: string (nullable = true)

僅展平 Scala Spark 數據幀中的最深級別

問題描述

1 個解決方案

解決方案1
5 已采納 2020-01-09 21:26:19

僅展平 Scala Spark 數據幀中的最深級別

問題描述

1 個解決方案

解決方案1 5 已采納 2020-01-09 21:26:19

解決方案1
5 已采納 2020-01-09 21:26:19