简体   繁体   中英

Parsing JSON file of specific format 'Struct of Array of Structs' into spark dataframe

My Json:

{"apps": {"app": [{"id": "id1","user": "hdfs"}, {"id": "id2","user": "yarn"}]}}

Schema:

root 
|-- apps: struct (nullable = true) 
| |-- app: array (nullable = true) 
| | |-- element: struct (containsNull = true) 
| | | |-- id: String (nullable = true) 
| | | |-- name: String (nullable = true)

My code:

StructType schema = new StructType()
                .add("apps",(new StructType()
                .add("app",(new StructType()))
                .add("element",new StructType().add("id",new StringType())add("user",new StringType())
                        )));
Dataset<Row> df = sparkSession.read().schema(schema).json(<path_to_json>);

It Gives me this error:

Exception in thread "main" scala.MatchError: org.apache.spark.sql.types.StringType@1fca53a7 (of class org.apache.spark.sql.types.StringType)

df.show() should show me:

id  user
id1 hdfs
id2 yarn

You do not need to provide a schema when reading the data, Spark can infer the schema automatically. However, to get the wanted output, some manipulation is necessary.

First, read the data:

Dataset<Row> df = sparkSession.read().json("<path_to_json>");

Use explode to put each Array element on its own row, then use select to unpack the data into separate columns.

df.withColumn("app", explode($"apps.app"))
  .select("app.*")

This should give you a dataframe in the expected format.

@saidu answer is correct. Though spark will infer the schema automatically but it's advisable to provide schema explicitly. In this scenario it will work as both the types are string. Take an example where first value of id is an integer. So in inferschema it will consider it as long.

I had a similar issue, and using auto-inferred schema was not a solution (inferior performance). Apparently, the error happens because you are using new StringType() to construct your native types. Instead, you should use the public members of DataTypes singleton:

StructType schema = new StructType()
  .add("apps", new StructType()
    .add("app", new ArrayType(new StructType()
      .add("id", DataTypes.StringType)
      .add("name", DataTypes.StringType))
  ));

Dataset<Row> df = sparkSession
  .read()
  .schema(schema)
  .json("<path_to_json>");

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM