简体   繁体   中英

Reading pretty print json files in Apache Spark

I have a lot of json files in my S3 bucket and I want to be able to read them and query those files. The problem is they are pretty printed. One json file has just one massive dictionary but it's not in one line. As per this thread, a dictionary in the json file should be in one line which is a limitation of Apache Spark. I don't have it structured that way.

My JSON schema looks like this -

{
    "dataset": [
        {
            "key1": [
                {
                    "range": "range1", 
                    "value": 0.0
                }, 
                {
                    "range": "range2", 
                    "value": 0.23
                }
             ]
        }, {..}, {..}
    ],
    "last_refreshed_time": "2016/09/08 15:05:31"
}

Here are my questions -

  1. Can I avoid converting these files to match the schema required by Apache Spark (one dictionary per line in a file) and still be able to read it?

  2. If not, what's the best way to do it in Python? I have a bunch of these files for each day in the bucket. The bucket is partitioned by day.

  3. Is there any other tool better suited to query these files other than Apache Spark? I'm on AWS stack so can try out any other suggested tool with Zeppelin notebook.

You could use sc.wholeTextFiles() Here is a related post .

Alternatively, you could reformat your json using a simple function and load the generated file.

def reformat_json(input_path, output_path):
    with open(input_path, 'r') as handle:
        jarr = json.load(handle)

    f = open(output_path, 'w')
    for entry in jarr:
        f.write(json.dumps(entry)+"\n")
    f.close()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM