Spark get column names of nested json

Question

I'm trying to get column names from a nested JSON via DataFrames. The schema is given below:

root
 |-- body: struct (nullable = true)
 |    |-- Sw1: string (nullable = true)
 |    |-- Sw2: string (nullable = true)
 |    |-- Sw3: string (nullable = true)
 |    |-- Sw420: string (nullable = true)
 |-- headers: struct (nullable = true)
 |    |-- endDate: string (nullable = true)
 |    |-- file: string (nullable = true)
 |    |-- startDate: string (nullable = true)

I can get the column names "body" and "header" with df.columns() but when I try to get the column names from the body (ex: Sw1, Sw2,...) with df.select("body").columns it always give me the body column.

Any suggestion? :)

Answer 1

If the question is how to find the nested column names, you can do this by inspecting the schema of the DataFrame. The schema is represented as a StructType which can fields of other DataType objects (included other nested structs). If you want to discover all the fields you'll have to walk this tree recursively. For example:

import org.apache.spark.sql.types._
def findFields(path: String, dt: DataType): Unit = dt match {
  case s: StructType => 
    s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
  case other => 
    println(s"$path: $other")
}

This walks the tree and prints out all the leaf fields and their type:

val df = sqlContext.read.json(sc.parallelize("""{"a": {"b": 1}}""" :: Nil))
findFields("", df.schema)

prints: .a.b: LongType

Answer 2

If the nested json has an array of StructTypes, then the following code can be used (the below code is extension to the code given by Michael Armbrust)

import org.apache.spark.sql.types._

def findFields(path: String, dt: DataType): Unit = dt match {
  case s: StructType => 
    s.fields.foreach(f => findFields(path + "." + f.name, f.dataType))
  case s: ArrayType => 
    findFields(path, s.elementType)
  case other => 
    println(s"$path")
}

Answer 3

To get the nested column names please use code like below :

From main method call like below:

findFields(df,df.schema)

Method:

def findFields(df:DataFrame,dt: DataType) = 
{
    val fieldName = dt.asInstanceOf[StructType].fields
    for (value <- fieldName) 
    {
      val colNames = value.productElement(1).asInstanceOf[StructType].fields
      for (f <- colNames)
      {
         println("Inner Columns of "+value.name+" -->>"+f.name)
      }
   }

}

Note: This will work only when both first set of columns are struct type.

Answer 4

很简单： df.select("body.Sw1", "body.Sw2")

Spark get column names of nested json

Question

4 answers

solution1
7 ACCPTED 2016-01-27 22:50:17

solution2
1 2021-03-17 01:19:43

solution3
0 2017-02-08 07:13:10

solution4
-2 2016-01-27 16:55:46

Spark get column names of nested json

Question

4 answers

solution1 7 ACCPTED 2016-01-27 22:50:17

solution2 1 2021-03-17 01:19:43

solution3 0 2017-02-08 07:13:10

solution4 -2 2016-01-27 16:55:46

solution1
7 ACCPTED 2016-01-27 22:50:17

solution2
1 2021-03-17 01:19:43

solution3
0 2017-02-08 07:13:10

solution4
-2 2016-01-27 16:55:46