I have multiple jsons in .gz file. I try to parse them into rdds and then to dataframes. I make it not in the best way:
rdd =sc.textFile(path).map(json.loads).map(get_values)
where get_values function is something like this:
def get_values(data):
try:
time = data['time']
except Exception:
time = None
try:
place = data['data']['place']
except Exception:
place = None
return time, place
The problem is that several jsons are different, some of them contains several categories, some not and it's difficult to write this function 'by hands' to make dataframe contain all the keys. The question is: are there any approaches/functions to automize this process?
You can read json files with:
df = spark.read.json(path)
Spark automatically tries to infer the schema and you can see it using
df.printSchema
If you have multi-line json use the option multiline = true
.
You can learn more about reading json files with Spark in the official documentation .
As a general advice, to parse a json from a string in Spark avoid using map
and json.loads
(or similar functions).
There is a faster solution already available in Spark: from_json
function .
Since your input data
is a dictionary, I guess you could use this function :
from functools import reduce
def get_values(data):
values = data.values()
while any([isinstance(value, dict) for value in values]):
not_dicts = list(filter(lambda value: not isinstance(value, dict), values))
dicts = list(filter(lambda value: isinstance(value, dict), values))
values = not_dicts + reduce(lambda l1, l2: l1+l2, [dict_.values() for dict_ in dicts])
return values
d = {1:1,
2:2,
3:{'a':4,
'b': 5,
'c': {'z': 6}
}
}
get_values(d)
[1, 2, 4, 5, 6]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.