简体   繁体   中英

Spark get column names of nested json

I'm trying to get column names from a nested JSON via DataFrames. The schema is given below:

root
 |-- body: struct (nullable = true)
 |    |-- Sw1: string (nullable = true)
 |    |-- Sw2: string (nullable = true)
 |    |-- Sw3: string (nullable = true)
 |    |-- Sw420: string (nullable = true)
 |-- headers: struct (nullable = true)
 |    |-- endDate: string (nullable = true)
 |    |-- file: string (nullable = true)
 |    |-- startDate: string (nullable = true)

I can get the column names "body" and "header" with df.columns() but when I try to get the column names from the body (ex: Sw1, Sw2,...) with df.select("body").columns it always give me the body column.

Any suggestion? :)

If the question is how to find the nested column names, you can do this by inspecting the schema of the DataFrame. The schema is represented as a StructType which can fields of other DataType objects (included other nested structs). If you want to discover all the fields you'll have to walk this tree recursively. For example:

import org.apache.spark.sql.types._
def findFields(path: String, dt: DataType): Unit = dt match {
  case s: StructType => 
    s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
  case other => 
    println(s"$path: $other")
}

This walks the tree and prints out all the leaf fields and their type:

val df = sqlContext.read.json(sc.parallelize("""{"a": {"b": 1}}""" :: Nil))
findFields("", df.schema)

prints: .a.b: LongType

If the nested json has an array of StructTypes, then the following code can be used (the below code is extension to the code given by Michael Armbrust)

import org.apache.spark.sql.types._

def findFields(path: String, dt: DataType): Unit = dt match {
  case s: StructType => 
    s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
  case s: ArrayType => 
    findFields(path, s.elementType)
  case other => 
    println(s"$path")
}

To get the nested column names please use code like below :

From main method call like below:

findFields(df,df.schema)

Method:

def findFields(df:DataFrame,dt: DataType) = 
{
    val fieldName = dt.asInstanceOf[StructType].fields
    for (value <- fieldName) 
    {
      val colNames = value.productElement(1).asInstanceOf[StructType].fields
      for (f <- colNames)
      {
         println("Inner Columns of "+value.name+" -->>"+f.name)
      }
   }

}

Note: This will work only when both first set of columns are struct type.

很简单: df.select("body.Sw1", "body.Sw2")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM