简体   繁体   English

用不一致的格式在 Pandas DataFrame 中展平 JSON

[英]Flatten JSON in Pandas DataFrame with Inconsistent Format

I'm working with a column in a pandas dataframe that I would like to flatten into columns.我正在处理熊猫数据框中的一列,我想将其展平为列。 The issue is that some columns are single dictionaries and others are lists of dictionaries.问题是某些列是单个字典,而其他列是字典列表。

So for instance the array in a cell in the column may look like this:因此,例如列中单元格中的数组可能如下所示:

{'a': 'x', 'b': 'y', 'c': 'z'} {'a':'x','b':'y','c':'z'}

or like:或喜欢:

[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}] [{'a':'x','b':'y','c':'z'},{'a':'d','b':'e','c':'f '}]

I've been playing around with the standard json_normalize function but the output can't seem to handle there being two different types of array:我一直在玩标准的 json_normalize 函数,但输出似乎无法处理两种不同类型的数组:

json_flattened = pd.io.json.json_normalize(df.json_cell.apply(json.loads))

This gives me AttributeError: 'list' object has no attribute 'items'这给了我AttributeError: 'list' object has no attribute 'items'

What I'm hoping to get out the other end is a dataframe that will end up with as many columns as there are repeated elements in the array.我希望在另一端得到的是一个数据框,它的列数与数组中的重复元素一样多。

.-------.---.---.---.-----.-----.------.
| index | a | b | c | a_1 | b_1 | c_1  |
:-------+---+---+---+-----+-----+------:
|     0 | x | y | z | NaN | NaN | NaN  |
:-------+---+---+---+-----+-----+------:
|     1 | x | y | z | d   | e   | f    |
'-------'---'---'---'-----'-----'------'

Because of the inconsistent format in the column you have no choice but to loop through the column and inspect each cell.由于列中的格式不一致,您别无选择,只能遍历列并检查每个单元格。

import pandas as pd
import numpy as np
from pandas.io.json import json_normalize


data = pd.DataFrame({'json':[{'a': 'x', 'b': 'y', 'c': 'z'},[{'a': 'x', 'b': 'y', 'c': 'z'}, {'a': 'd', 'b': 'e', 'c': 'f'}]]})

dflist = []
for index, row in data.iterrows():
    norm = json_normalize(data['json'][index])
    dflist.append(norm)

grouped = pd.concat(dflist, keys = np.arange(len(dflist)))

df2 = grouped.reset_index()

df2 = df2.pivot('level_0','level_1')

df2.index.name = None
df2.columns.names = [None, None]
df2.columns = df2.columns.map('{0[0]}_{0[1]}'.format) 

#output

    a_0  a_1    b_0  b_1    c_0  c_1
0   x    NaN    y    NaN    z    NaN
1   x    d      y    e      z    f

We loop through the column, reading each into row into its own df.我们循环遍历列,将每个读入行到其自己的 df 中。 Next we concat them together but assign each keys to keep them grouped.接下来,我们将它们连接在一起,但分配每个键以将它们分组。 Next we pivot on the keys and the index.接下来,我们以键和索引为中心。 Finally we manipulate the multilevel columns generated from the pivot to get what we need.最后,我们操作从枢轴生成的多级列以获得我们需要的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM