简体   繁体   中英

Convert JSON to CSV with complex arrays in Python

I have several JSON files with nested data. Utilizing Python, I was able to use pandas to help with that:

import pandas as pd

df = pd.read_json (r'data.json')
export_csv = df.to_csv (r'data.csv', index = None, header=True)

However, this only works for simple JSON files. The ones I have are complex with nested arrays and some of the JSON data is merged under the columns. For example, if we're going to use this sample data:

data.json

[
  {
    "id": 1,
    "name": {
      "english": "Bulbasaur",
      "french": "Bulbizarre"
    },
    "type": [
      "Grass",
      "Poison"
    ],
    "base": {
      "HP": 45,
      "Attack": 49,
      "Defense": 49
    }
  },
  {
    "id": 2,
    "name": {
      "english": "Ivysaur",
      "french": "Herbizarre"
    },
    "type": [
      "Grass",
      "Poison"
    ],
    "base": {
      "HP": 60,
      "Attack": 62,
      "Defense": 63
    }
  }
]

The result ends up like the following:

CSV 输出

You can see that any array past the first level is showing it in JSON (eg {'english': 'Bulbasaur', 'french': 'Bulbizarre'} ). Ideally, it should break those child arrays into a column with the name of the element:

预期产出

On top of that, the other JSON files have different element names and order. Therefore, the script should catch all of the different element names and then convert them into CSV columns.

How can I achieve this?

check out flatten_json

from flatten_json import flatten
dic = [
  {
    "id": 1,
    "name": {
      "english": "Bulbasaur",
      "french": "Bulbizarre"
    },
    "type": [
      "Grass",
      "Poison"
    ],
    "base": {
      "HP": 45,
      "Attack": 49,
      "Defense": 49
    }
  },
  {
    "id": 2,
    "name": {
      "english": "Ivysaur",
      "french": "Herbizarre"
    },
    "type": [
      "Grass",
      "Poison"
    ],
    "base": {
      "HP": 60,
      "Attack": 62,
      "Defense": 63
    }
  }
]

dic_flattened = (flatten(d, '.') for d in dic)
df = pd.DataFrame(dic_flattened)

Output:

   id name.english name.french type.0  type.1  base.HP  base.Attack  base.Defense
0   1    Bulbasaur  Bulbizarre  Grass  Poison       45           49            49
1   2      Ivysaur  Herbizarre  Grass  Poison       60           62            63

Using json_normalize will get you almost there but to split the list you need something extra:

f = lambda x: 'type.{}'.format(x + 1)
df = df.join(pd.DataFrame(df.pop('type').values.tolist()).rename(columns=f))

print(df)

Output

   id name.english name.french  ...  base.Defense  type.1  type.2
0   1    Bulbasaur  Bulbizarre  ...            49   Grass  Poison
1   2      Ivysaur  Herbizarre  ...            63   Grass  Poison

[2 rows x 8 columns]

I'll suggest using a for loop, coupled with a defaultdict , usually easier and faster when doing iterations (that do not have aggregations) to stay out of pandas until the final output:

from collections import defaultdict

df = defaultdict(list)

val = {}
box = []
for entry in data: # data is the sample data you shared
    for key, value in entry.items():
        if key == "id":
            temp = [(key, value)]
        elif isinstance(value, dict):
            temp = [(f"{key}.{k}", v) for k, v in value.items()]
        else:
            temp = [(f"{key}.{k}", v) for k, v in enumerate(value, 1)]
        box.extend(temp)

for k, v in box:
    df[k].append(v)


df

defaultdict(list,
            {'id': [1, 2],
             'name.english': ['Bulbasaur', 'Ivysaur'],
             'name.french': ['Bulbizarre', 'Herbizarre'],
             'type.1': ['Grass', 'Grass'],
             'type.2': ['Poison', 'Poison'],
             'base.HP': [45, 60],
             'base.Attack': [49, 62],
             'base.Defense': [49, 63]})

Create dataframe

pd.DataFrame(df)

    id  name.english    name.french type.1  type.2  base.HP base.Attack base.Defense
0   1   Bulbasaur      Bulbizarre   Grass   Poison     45      49       49
1   2   Ivysaur        Herbizarre   Grass   Poison     60      62       63

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM