如何從嵌套結構化Spark數據框架獲取Schema作為Spark數據框架

Question

我有一個使用以下代碼創建的示例數據框

val data = Seq(
  Row(20.0, "dog"),
  Row(3.5, "cat"),
  Row(0.000006, "ant")
)

val schema = StructType(
  List(
    StructField("weight", DoubleType, true),
    StructField("animal_type", StringType, true)
  )
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val actualDF = df.withColumn(
  "animal_interpretation",
  struct(
    (col("weight") > 5).as("is_large_animal"),
    col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
  )
)

actualDF.show(false)

+------+-----------+---------------------+
|weight|animal_type|animal_interpretation|
+------+-----------+---------------------+
|20.0  |dog        |[true,true]          |
|3.5   |cat        |[false,true]         |
|6.0E-6|ant        |[false,false]        |
+------+-----------+---------------------+

可以使用以下命令打印此Spark DF的模式：

scala> actualDF.printSchema
root
 |-- weight: double (nullable = true)
 |-- animal_type: string (nullable = true)
 |-- animal_interpretation: struct (nullable = false)
 |    |-- is_large_animal: boolean (nullable = true)
 |    |-- is_mammal: boolean (nullable = true)

但是，我想以具有3列的數據框的形式獲取此架構-field field, type, nullable 。 模式的輸出數據幀將如下所示：

+-------------------------------------+--------------+--------+
|field                                |type          |nullable|
+-------------------------------------+--------------+--------+
|weight                               |double        |true    |        
|animal_type                          |string        |true    |       
|animal_interpretation                |struct        |false   |
|animal_interpretation.is_large_animal|boolean       |true    |
|animal_interpretation.is_mammal      |boolean       |true    |     
+----------------------------------------------------+--------+

如何在Spark中實現這一目標。 我正在使用Scala進行編碼。

Answer 1

你可以做這樣的事情

def flattenSchema(schema: StructType, prefix: String = null) : Seq[(String, String, Boolean)] = {
  schema.fields.flatMap(field => {
    val col = if (prefix == null) field.name else (prefix + "." + field.name)
    field.dataType match {
      case st: StructType => flattenSchema(st, col)
      case _ => Array((col, field.dataType.simpleString, field.nullable))
    }
  })
}

flattenSchema(actualDF.schema).toDF("field", "type", "nullable").show()

希望這可以幫助！

Answer 2

這是一個包含您的代碼的完整示例。 我使用了某種通用的flattenSchema方法進行匹配，就像Shankar遍歷Struct一樣，但是沒有讓此方法返回展平的架構，而是使用ArrayBuffer來聚合StructType的數據類型並返回ArrayBuffer。 然后，我將ArrayBuffer轉換為Sequence，最后使用Spark將Sequence轉換為DataFrame。

import org.apache.spark.sql.types.{StructType, StructField, DoubleType, StringType}
import org.apache.spark.sql.functions.{struct, col}
import scala.collection.mutable.ArrayBuffer

val data = Seq(
  Row(20.0, "dog"),
  Row(3.5, "cat"),
  Row(0.000006, "ant")
)

val schema = StructType(
  List(
    StructField("weight", DoubleType, true),
    StructField("animal_type", StringType, true)
  )
)

val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

val actualDF = df.withColumn(
  "animal_interpretation",
  struct(
    (col("weight") > 5).as("is_large_animal"),
    col("animal_type").isin("rat", "cat", "dog").as("is_mammal")
  )
)

var fieldStructs = new ArrayBuffer[(String, String, Boolean)]()

def flattenSchema(schema: StructType, fieldStructs: ArrayBuffer[(String, String, Boolean)], prefix: String = null): ArrayBuffer[(String, String, Boolean)] = {
  schema.fields.foreach(field => {
    val col = if (prefix == null) field.name else (prefix + "." + field.name)
    field.dataType match {
      case st: StructType => {
        fieldStructs += ((col, field.dataType.typeName, field.nullable))
        flattenSchema(st, fieldStructs, col)
      }
      case _ => {
        fieldStructs += ((col, field.dataType.simpleString, field.nullable))
      }
    }}
  )
  fieldStructs
}

val foo = flattenSchema(actualDF.schema, fieldStructs).toSeq.toDF("field", "type", "nullable")
foo.show(false)

如果運行以上命令，則應獲得以下信息。

+-------------------------------------+-------+--------+
|field                                |type   |nullable|
+-------------------------------------+-------+--------+
|weight                               |double |true    |
|animal_type                          |string |true    |
|animal_interpretation                |struct |false   |
|animal_interpretation.is_large_animal|boolean|true    |
|animal_interpretation.is_mammal      |boolean|true    |
+-------------------------------------+-------+--------+

如何從嵌套結構化Spark數據框架獲取Schema作為Spark數據框架

問題描述

2 個解決方案

解決方案1
0 2019-07-16 20:12:46

解決方案2
0 已采納 2019-07-16 22:57:35

如何從嵌套結構化Spark數據框架獲取Schema作為Spark數據框架

問題描述

2 個解決方案

解決方案1 0 2019-07-16 20:12:46

解決方案2 0 已采納 2019-07-16 22:57:35

解決方案1
0 2019-07-16 20:12:46

解決方案2
0 已采納 2019-07-16 22:57:35