简体   繁体   中英

Reading Json file using Apache Spark

I am trying to read Json file using Spark v2.0.0. In case of simple data code works really well. In case of little bit complex data, when i print df.show() the data is not showing in correct way.

here is my code:

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();

Here is my sample data:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

And my output is like:

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|       "glossary": {|
|        "title": ...|
|           "GlossDiv": {|
|            "titl...|
|               "GlossList": {|
|                "...|
|                 ...|
|                   "SortAs": "S...|
|                   "GlossTerm":...|
|                   "Acronym": "...|
|                   "Abbrev": "I...|
|                   "GlossDef": {|
|                 ...|
|                       "GlossSeeAl...|
|                 ...|
|                   "GlossSee": ...|
|                   }|
|                   }|
|                   }|
+--------------------+
only showing top 20 rows

You will need to format the JSON to one line if you have to read this JSON. This is a multi line JSON and hence is not being read and loaded properly (One Object one Row)

Quoting the JSON API :

Loads a JSON file (one object per line) and returns the result as a DataFrame.

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}

I just tried it on the shell , it should work from the code as well the same way (I had the same corrupted record error when i read a multi line JSON)

scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]

scala>

Edits :

You can get the values out from that data frame using any action , for example

scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
|           GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+


scala>

You should be able to do it from your code as well

Just make sure your json is in one line you are reading nested json so, if you already did this you are successfully loaded the json you are showing it in wrong way its nested json so you cant directly show, like if you want the title data of GlossDiv you can show it as follow

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.select("glossary.GlossDiv.title") .show

尝试:

session.read().json(session.sparkContext.wholeTextFiles("..."));

This thread is little old, I want to just elaborate on what @user6022341 has suggested. I ended up using it one of my projects:

To process the multiline json file, wholeTextFiles(String path) transformation is the only solution in spark, if the file is one big json object. This transformation will load entire file content as a string. So, if in hdfs://a-hdfs-path directory you had two files namely, part-00000 and part-00001. Calling sparkContext.wholeTextFiles("hdfs://a-hdfs-path") will result in Spark returning a JavaPairRDD which has key as file name and value as the content of the file. This may not be the best solution and may hit performance for bigger files.

But if the multiline json file had multiple json objects split into multiple lines then you could probably use the hadoop.Configuration, some sample code is shown here . I haven't tested this out myself.

If you had to read a multiline csv file, you could do this with Spark 2.2

 spark.read.csv(file, multiLine=True) 

https://issues.apache.org/jira/browse/SPARK-19610

https://issues.apache.org/jira/browse/SPARK-20980

Hope this helps other folks looking for similar info.

Another way to read JSON file using Java in Spark is similar to as mentioned above:

SparkSession spark = SparkSession.builder().appName("ProcessJSONData")
                        .master("local").getOrCreate();

String path = "C:/XX/XX/myData.json";

// Encoders are created for Java bean class
Encoder<FruitJson> fruitEncoder = Encoders.bean(FruitJson.class);

Dataset<FruitJson> fruitDS = spark.read().json(path).as(fruitEncoder);

fruitDS.show();

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM