Putting jsons to dataframes automatically in spark

Question

I have multiple jsons in .gz file. I try to parse them into rdds and then to dataframes. I make it not in the best way:

rdd =sc.textFile(path).map(json.loads).map(get_values)

where get_values function is something like this:

def get_values(data): 
    try:
        time = data['time']
    except Exception:
        time = None
    try:
        place = data['data']['place']
    except Exception:
        place = None
return time, place

The problem is that several jsons are different, some of them contains several categories, some not and it's difficult to write this function 'by hands' to make dataframe contain all the keys. The question is: are there any approaches/functions to automize this process?

Answer 1

You can read json files with:

df = spark.read.json(path)

Spark automatically tries to infer the schema and you can see it using

df.printSchema

If you have multi-line json use the option multiline = true .

You can learn more about reading json files with Spark in the official documentation .

Parsing json from string in Spark

As a general advice, to parse a json from a string in Spark avoid using map and json.loads (or similar functions).

There is a faster solution already available in Spark: from_json function .

Answer 2

Since your input data is a dictionary, I guess you could use this function :

from functools import reduce

def get_values(data):
    values = data.values()
    while any([isinstance(value, dict) for value in values]):
        not_dicts = list(filter(lambda value: not isinstance(value, dict), values))
        dicts = list(filter(lambda value: isinstance(value, dict), values))
        values = not_dicts + reduce(lambda l1, l2: l1+l2, [dict_.values() for dict_ in dicts])
    return values

d = {1:1,
     2:2,
     3:{'a':4,
        'b': 5,
        'c': {'z': 6}
       }
    }
get_values(d)

[1, 2, 4, 5, 6]

Putting jsons to dataframes automatically in spark

Question

2 answers

solution1
1 2019-07-26 09:34:51

Parsing json from string in Spark

solution2
0 2019-07-26 09:54:17

Putting jsons to dataframes automatically in spark

Question

2 answers

solution1 1 2019-07-26 09:34:51

Parsing json from string in Spark

solution2 0 2019-07-26 09:54:17

solution1
1 2019-07-26 09:34:51

solution2
0 2019-07-26 09:54:17