I have several JSON files with nested data. Utilizing Python, I was able to use pandas
to help with that:
import pandas as pd
df = pd.read_json (r'data.json')
export_csv = df.to_csv (r'data.csv', index = None, header=True)
However, this only works for simple JSON files. The ones I have are complex with nested arrays and some of the JSON data is merged under the columns. For example, if we're going to use this sample data:
data.json
[
{
"id": 1,
"name": {
"english": "Bulbasaur",
"french": "Bulbizarre"
},
"type": [
"Grass",
"Poison"
],
"base": {
"HP": 45,
"Attack": 49,
"Defense": 49
}
},
{
"id": 2,
"name": {
"english": "Ivysaur",
"french": "Herbizarre"
},
"type": [
"Grass",
"Poison"
],
"base": {
"HP": 60,
"Attack": 62,
"Defense": 63
}
}
]
The result ends up like the following:
You can see that any array past the first level is showing it in JSON (eg {'english': 'Bulbasaur', 'french': 'Bulbizarre'}
). Ideally, it should break those child arrays into a column with the name of the element:
On top of that, the other JSON files have different element names and order. Therefore, the script should catch all of the different element names and then convert them into CSV columns.
How can I achieve this?
check out flatten_json
from flatten_json import flatten
dic = [
{
"id": 1,
"name": {
"english": "Bulbasaur",
"french": "Bulbizarre"
},
"type": [
"Grass",
"Poison"
],
"base": {
"HP": 45,
"Attack": 49,
"Defense": 49
}
},
{
"id": 2,
"name": {
"english": "Ivysaur",
"french": "Herbizarre"
},
"type": [
"Grass",
"Poison"
],
"base": {
"HP": 60,
"Attack": 62,
"Defense": 63
}
}
]
dic_flattened = (flatten(d, '.') for d in dic)
df = pd.DataFrame(dic_flattened)
Output:
id name.english name.french type.0 type.1 base.HP base.Attack base.Defense
0 1 Bulbasaur Bulbizarre Grass Poison 45 49 49
1 2 Ivysaur Herbizarre Grass Poison 60 62 63
Using json_normalize will get you almost there but to split the list you need something extra:
f = lambda x: 'type.{}'.format(x + 1)
df = df.join(pd.DataFrame(df.pop('type').values.tolist()).rename(columns=f))
print(df)
Output
id name.english name.french ... base.Defense type.1 type.2
0 1 Bulbasaur Bulbizarre ... 49 Grass Poison
1 2 Ivysaur Herbizarre ... 63 Grass Poison
[2 rows x 8 columns]
I'll suggest using a for loop, coupled with a defaultdict , usually easier and faster when doing iterations (that do not have aggregations) to stay out of pandas until the final output:
from collections import defaultdict
df = defaultdict(list)
val = {}
box = []
for entry in data: # data is the sample data you shared
for key, value in entry.items():
if key == "id":
temp = [(key, value)]
elif isinstance(value, dict):
temp = [(f"{key}.{k}", v) for k, v in value.items()]
else:
temp = [(f"{key}.{k}", v) for k, v in enumerate(value, 1)]
box.extend(temp)
for k, v in box:
df[k].append(v)
df
defaultdict(list,
{'id': [1, 2],
'name.english': ['Bulbasaur', 'Ivysaur'],
'name.french': ['Bulbizarre', 'Herbizarre'],
'type.1': ['Grass', 'Grass'],
'type.2': ['Poison', 'Poison'],
'base.HP': [45, 60],
'base.Attack': [49, 62],
'base.Defense': [49, 63]})
Create dataframe
pd.DataFrame(df)
id name.english name.french type.1 type.2 base.HP base.Attack base.Defense
0 1 Bulbasaur Bulbizarre Grass Poison 45 49 49
1 2 Ivysaur Herbizarre Grass Poison 60 62 63
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.