简体   繁体   English

从 Python 中的 JSON 文件创建 DataFrame

[英]Creating a DataFrame from a JSON file in Python

I have some json with multiple blocks with this format (using only one here to make it simple, so in this example the dataframe would have only one line):我有一些 json 有多个具有这种格式的块(为了简单起见,这里只使用一个块,所以在这个例子中 dataframe 将只有一行):

{
      "A": 1,
      "B": {
        "C": [
          {
            "D": 2,
            "E": 3
          }
        ],
        "F": {
          "G": 4,
          "H": 5
        }
      }
}

And I want to create a DataFrame like this:我想像这样创建一个 DataFrame:

  A B.C.D B.C.E B.F.G B.F.H
1 1   2     3     4     5

When I try to do当我尝试做

with open('some.json') as file:
    data = json.load(file)

df = pd.json_normalize(data)

I get something like this:我得到这样的东西:

  A        B.C          B.F.G  B.F.H
1 1  [{"D":2,"E":3}]      4      5

So... I can get the column B.C, break it into the B.C.D and B.C.E所以...我可以得到列 B.C,将其分成 B.C.D 和 B.C.E

df2 = pd.DataFrame(df['B.C'].tolist())
df3 = df2[0].apply(pd.Series) #The [0] here is the only way to work when I have more than one block in the json

Them later concatenate with the previous dataframe (and removing the B.C column) but it feels ugly and since I'm doing this a LOT I was thinking if there's a cleaner/faster way.它们后来与以前的 dataframe 连接(并删除了 B.C 列),但感觉很难看,因为我经常这样做,所以我在想是否有更清洁/更快的方法。

Well, thanks in advance!嗯,提前谢谢!

I guess you could write a recursive solution to preprocess the data.我想您可以编写一个递归解决方案来预处理数据。 There might be some existing inbuilt solution but I am not aware of it.可能有一些现有的内置解决方案,但我不知道。 You can test the performance of the following:您可以测试以下各项的性能:

def clean(data):
    clean_data = {}
    def parse(data, key=''):
        if isinstance(data, list):
            for elem in data:
                parse(elem,key=key)
        else:
            for k, v in data.items():
                if isinstance(v, (dict, list)):
                    parse(v, key=key+'.'+k)
                else:
                    clean_data[(key+'.'+k).lstrip('.')] = v
    parse(data)
    return [clean_data]

data = {'A': 1, 'B': {'C': [{'D': 2, 'E': 3}], 'F': {'G': 4, 'H': 5}}}
print(pd.DataFrame(clean(data)))

Output: Output:

   A  B.C.D  B.C.E  B.F.G  B.F.H
0  1      2      3      4      5

Write yourself a recursive function:给自己写一个递归的 function:

def get_values(iterable, root=""):
    if isinstance(iterable, dict):
        for key, values in iterable.items():
            if isinstance(values, (dict, list)):
                new_root = "{}.{}".format(root, key) if root else key
                yield from get_values(values, new_root)
            else:
                absolute_key = "{}.{}".format(root, key) if root else key
                yield absolute_key, values
    elif isinstance(iterable, list):
        for dct in iterable:
            yield from get_values(dct, root)

result = [item for item in get_values(data)]
print(result)

Which yields哪个产量

[('A', 1), ('B.C.D', 2), ('B.C.E', 3), ('B.F.G', 4), ('B.F.H', 5)]

To transform it into a DataFrame , use:要将其转换为DataFrame ,请使用:

result = dict([item for item in get_values(data)])

import pandas as pd
df = pd.DataFrame(result, index=[0])
print(df)

Which then yields然后产生

   A  B.C.D  B.C.E  B.F.G  B.F.H
0  1      2      3      4      5

You should checkout flatten JSON .您应该结帐展平 JSON It's the best way to flatten JSON's with multiple record paths这是使用多个记录路径展平 JSON 的最佳方式

import flatten_json

with open('1.json', 'r+') as f:
    data = json.load(f)
dic_flattened = flatten_json.flatten(data)
df = pd.json_normalize(dic_flattened)
print(df)

   A  B_C_0_D  B_C_0_E  B_F_G  B_F_H
0  1        2        3      4      5

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM