简体   繁体   中英

Putting jsons to dataframes automatically in spark

I have multiple jsons in .gz file. I try to parse them into rdds and then to dataframes. I make it not in the best way:

rdd =sc.textFile(path).map(json.loads).map(get_values)

where get_values function is something like this:

def get_values(data): 
    try:
        time = data['time']
    except Exception:
        time = None
    try:
        place = data['data']['place']
    except Exception:
        place = None
return time, place

The problem is that several jsons are different, some of them contains several categories, some not and it's difficult to write this function 'by hands' to make dataframe contain all the keys. The question is: are there any approaches/functions to automize this process?

You can read json files with:

df = spark.read.json(path)

Spark automatically tries to infer the schema and you can see it using

df.printSchema

If you have multi-line json use the option multiline = true .

You can learn more about reading json files with Spark in the official documentation .


Parsing json from string in Spark

As a general advice, to parse a json from a string in Spark avoid using map and json.loads (or similar functions).

There is a faster solution already available in Spark: from_json function .

Since your input data is a dictionary, I guess you could use this function :

from functools import reduce

def get_values(data):
    values = data.values()
    while any([isinstance(value, dict) for value in values]):
        not_dicts = list(filter(lambda value: not isinstance(value, dict), values))
        dicts = list(filter(lambda value: isinstance(value, dict), values))
        values = not_dicts + reduce(lambda l1, l2: l1+l2, [dict_.values() for dict_ in dicts])
    return values

d = {1:1,
     2:2,
     3:{'a':4,
        'b': 5,
        'c': {'z': 6}
       }
    }
get_values(d)

[1, 2, 4, 5, 6]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM