简体   繁体   English

使用Apache Spark读取Json文件

[英]Reading Json file using Apache Spark

I am trying to read Json file using Spark v2.0.0. 我正在尝试使用Spark v2.0.0读取Json文件。 In case of simple data code works really well. 在简单数据的情况下,代码工作得很好。 In case of little bit complex data, when i print df.show() the data is not showing in correct way. 在有点复杂的数据的情况下,当我打印df.show()时,数据无法正确显示。

here is my code: 这是我的代码:

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.show();

Here is my sample data: 这是我的示例数据:

{
    "glossary": {
        "title": "example glossary",
        "GlossDiv": {
            "title": "S",
            "GlossList": {
                "GlossEntry": {
                    "ID": "SGML",
                    "SortAs": "SGML",
                    "GlossTerm": "Standard Generalized Markup Language",
                    "Acronym": "SGML",
                    "Abbrev": "ISO 8879:1986",
                    "GlossDef": {
                        "para": "A meta-markup language, used to create markup languages such as DocBook.",
                        "GlossSeeAlso": ["GML", "XML"]
                    },
                    "GlossSee": "markup"
                }
            }
        }
    }
}

And my output is like: 和我的输出是这样的:

+--------------------+
|     _corrupt_record|
+--------------------+
|                   {|
|       "glossary": {|
|        "title": ...|
|           "GlossDiv": {|
|            "titl...|
|               "GlossList": {|
|                "...|
|                 ...|
|                   "SortAs": "S...|
|                   "GlossTerm":...|
|                   "Acronym": "...|
|                   "Abbrev": "I...|
|                   "GlossDef": {|
|                 ...|
|                       "GlossSeeAl...|
|                 ...|
|                   "GlossSee": ...|
|                   }|
|                   }|
|                   }|
+--------------------+
only showing top 20 rows

You will need to format the JSON to one line if you have to read this JSON. 如果必须阅读此JSON,则需要将JSON格式化为一行。 This is a multi line JSON and hence is not being read and loaded properly (One Object one Row) 这是多行JSON,因此无法正确读取和加载(一个对象一行)

Quoting the JSON API : 引用JSON API:

Loads a JSON file (one object per line) and returns the result as a DataFrame. 加载JSON文件(每行一个对象),并将结果作为DataFrame返回。

{"glossary":{"title":"example glossary","GlossDiv":{"title":"S","GlossList":{"GlossEntry":{"ID":"SGML","SortAs":"SGML","GlossTerm":"Standard Generalized Markup Language","Acronym":"SGML","Abbrev":"ISO 8879:1986","GlossDef":{"para":"A meta-markup language, used to create markup languages such as DocBook.","GlossSeeAlso":["GML","XML"]},"GlossSee":"markup"}}}}}

I just tried it on the shell , it should work from the code as well the same way (I had the same corrupted record error when i read a multi line JSON) 我只是在shell上尝试过,它也应该以相同的方式在代码中工作(当我读取多行JSON时,我也遇到了相同的记录错误)

scala> val df = spark.read.json("C:/DevelopmentTools/data.json")
df: org.apache.spark.sql.DataFrame = [glossary: struct<GlossDiv: struct<GlossList: struct<GlossEntry: struct<Abbrev: string, Acronym: string ... 5 more fields>>, title: string>, title: string>]

scala>

Edits : 编辑:

You can get the values out from that data frame using any action , for example 您可以使用任何操作从该数据框中获取值,例如

scala> df.select(df("glossary.GlossDiv.GlossList.GlossEntry.GlossTerm")).show()
+--------------------+
|           GlossTerm|
+--------------------+
|Standard Generali...|
+--------------------+


scala>

You should be able to do it from your code as well 您也应该能够从代码中完成此操作

Just make sure your json is in one line you are reading nested json so, if you already did this you are successfully loaded the json you are showing it in wrong way its nested json so you cant directly show, like if you want the title data of GlossDiv you can show it as follow 只需确保您的json位于一行中即可读取嵌套的json,因此,如果您已经成功加载了json,则会以错误的方式显示其嵌套的json,因此您无法直接显示它,就像您想要标题数据一样的GlossDiv,您可以显示如下

SparkSession session = SparkSession.builder().master("local").appName("jsonreader").getOrCreate();
Dataset<Row> list = session.read().json("/Users/hadoop/Desktop/sample.json");
list.select("glossary.GlossDiv.title") .show

尝试:

session.read().json(session.sparkContext.wholeTextFiles("..."));

This thread is little old, I want to just elaborate on what @user6022341 has suggested. 该线程有点旧,我只想详细说明@ user6022341的建议。 I ended up using it one of my projects: 我最终使用它是我的一个项目:

To process the multiline json file, wholeTextFiles(String path) transformation is the only solution in spark, if the file is one big json object. 要处理多行json文件,如果文件是一个大的json对象,则WholeTextFiles(String path)转换是spark中的唯一解决方案。 This transformation will load entire file content as a string. 此转换将整个文件内容作为字符串加载。 So, if in hdfs://a-hdfs-path directory you had two files namely, part-00000 and part-00001. 因此,如果在hdfs:// a-hdfs-path目录中,则有两个文件,即part-00000和part-00001。 Calling sparkContext.wholeTextFiles("hdfs://a-hdfs-path") will result in Spark returning a JavaPairRDD which has key as file name and value as the content of the file. 调用sparkContext.wholeTextFiles(“ hdfs:// a-hdfs-path”)将导致Spark返回JavaPairRDD,该JavaPairRDD的键为文件名,值为文件的内容。 This may not be the best solution and may hit performance for bigger files. 这可能不是最佳解决方案,并且可能会影响较大文件的性能。

But if the multiline json file had multiple json objects split into multiple lines then you could probably use the hadoop.Configuration, some sample code is shown here . 但是,如果多行json文件将多个json对象分成多行,那么您可能可以使用hadoop.Configuration, 此处显示一些示例代码。 I haven't tested this out myself. 我还没有自己测试过。

If you had to read a multiline csv file, you could do this with Spark 2.2 如果必须读取多行csv文件,则可以使用Spark 2.2执行此操作

 spark.read.csv(file, multiLine=True) 

https://issues.apache.org/jira/browse/SPARK-19610 https://issues.apache.org/jira/browse/SPARK-19610

https://issues.apache.org/jira/browse/SPARK-20980 https://issues.apache.org/jira/browse/SPARK-20980

Hope this helps other folks looking for similar info. 希望这可以帮助其他人寻找类似的信息。

Another way to read JSON file using Java in Spark is similar to as mentioned above: 在Spark中使用Java读取JSON文件的另一种方式与上述类似:

SparkSession spark = SparkSession.builder().appName("ProcessJSONData")
                        .master("local").getOrCreate();

String path = "C:/XX/XX/myData.json";

// Encoders are created for Java bean class
Encoder<FruitJson> fruitEncoder = Encoders.bean(FruitJson.class);

Dataset<FruitJson> fruitDS = spark.read().json(path).as(fruitEncoder);

fruitDS.show();

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM