[英]Creating a DataFrame from a JSON file in Python
I have some json with multiple blocks with this format (using only one here to make it simple, so in this example the dataframe would have only one line):我有一些 json 有多个具有这种格式的块(为了简单起见,这里只使用一个块,所以在这个例子中 dataframe 将只有一行):
{
"A": 1,
"B": {
"C": [
{
"D": 2,
"E": 3
}
],
"F": {
"G": 4,
"H": 5
}
}
}
And I want to create a DataFrame like this:我想像这样创建一个 DataFrame:
A B.C.D B.C.E B.F.G B.F.H
1 1 2 3 4 5
When I try to do当我尝试做
with open('some.json') as file:
data = json.load(file)
df = pd.json_normalize(data)
I get something like this:我得到这样的东西:
A B.C B.F.G B.F.H
1 1 [{"D":2,"E":3}] 4 5
So... I can get the column B.C, break it into the B.C.D and B.C.E所以...我可以得到列 B.C,将其分成 B.C.D 和 B.C.E
df2 = pd.DataFrame(df['B.C'].tolist())
df3 = df2[0].apply(pd.Series) #The [0] here is the only way to work when I have more than one block in the json
Them later concatenate with the previous dataframe (and removing the B.C column) but it feels ugly and since I'm doing this a LOT I was thinking if there's a cleaner/faster way.它们后来与以前的 dataframe 连接(并删除了 B.C 列),但感觉很难看,因为我经常这样做,所以我在想是否有更清洁/更快的方法。
Well, thanks in advance!嗯,提前谢谢!
I guess you could write a recursive solution to preprocess the data.我想您可以编写一个递归解决方案来预处理数据。 There might be some existing inbuilt solution but I am not aware of it.可能有一些现有的内置解决方案,但我不知道。 You can test the performance of the following:您可以测试以下各项的性能:
def clean(data):
clean_data = {}
def parse(data, key=''):
if isinstance(data, list):
for elem in data:
parse(elem,key=key)
else:
for k, v in data.items():
if isinstance(v, (dict, list)):
parse(v, key=key+'.'+k)
else:
clean_data[(key+'.'+k).lstrip('.')] = v
parse(data)
return [clean_data]
data = {'A': 1, 'B': {'C': [{'D': 2, 'E': 3}], 'F': {'G': 4, 'H': 5}}}
print(pd.DataFrame(clean(data)))
Output: Output:
A B.C.D B.C.E B.F.G B.F.H
0 1 2 3 4 5
Write yourself a recursive function:给自己写一个递归的 function:
def get_values(iterable, root=""):
if isinstance(iterable, dict):
for key, values in iterable.items():
if isinstance(values, (dict, list)):
new_root = "{}.{}".format(root, key) if root else key
yield from get_values(values, new_root)
else:
absolute_key = "{}.{}".format(root, key) if root else key
yield absolute_key, values
elif isinstance(iterable, list):
for dct in iterable:
yield from get_values(dct, root)
result = [item for item in get_values(data)]
print(result)
Which yields哪个产量
[('A', 1), ('B.C.D', 2), ('B.C.E', 3), ('B.F.G', 4), ('B.F.H', 5)]
To transform it into a DataFrame
, use:要将其转换为DataFrame
,请使用:
result = dict([item for item in get_values(data)])
import pandas as pd
df = pd.DataFrame(result, index=[0])
print(df)
Which then yields然后产生
A B.C.D B.C.E B.F.G B.F.H
0 1 2 3 4 5
You should checkout flatten JSON .您应该结帐展平 JSON 。 It's the best way to flatten JSON's with multiple record paths这是使用多个记录路径展平 JSON 的最佳方式
import flatten_json
with open('1.json', 'r+') as f:
data = json.load(f)
dic_flattened = flatten_json.flatten(data)
df = pd.json_normalize(dic_flattened)
print(df)
A B_C_0_D B_C_0_E B_F_G B_F_H
0 1 2 3 4 5
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.