[英]Pandas: How to flatten lists of dicts within a list of dicts into dataframe, throwing error if any dict in nested list is missing any specified keys?
My ultimate goal is to flatten a key within a list of dicts into a dataframe.我的最终目标是将 dicts 列表中的一个键压平到一个数据帧中。 The key's value is also a list of dicts, and that list could be empty for any given record in the top-level list.键的值也是一个字典列表,对于顶级列表中的任何给定记录,该列表可以为空。 I'm trying to do this quickly, so trying to use vectorized operations in pandas, including json_normalize
(which I assume is better than a loop).我正在尝试快速执行此操作,因此尝试在json_normalize
使用矢量化操作,包括json_normalize
(我认为它比循环更好)。
In the resulting dataframe, I want to keep some top-level columns while flattening all of the keys in the list of nested dicts.在生成的数据框中,我想保留一些顶级列,同时展平嵌套字典列表中的所有键。 I also want the operation to fail if there is any dict in any nested list that does not have all of the keys I specify.如果任何嵌套列表中没有包含我指定的所有键的任何字典,我也希望操作失败。 But, I do not want the operation to fail if the key exists but is None
.但是,如果键存在但为None
,我不希望操作失败。 (This is why I can't just check NaN
after the normalization -- None
is converted to NaN
for float types in json_normalize
, since it does not provide a dtype
arg, so I wouldn't know whether it was NaN
because the key didn't exist, or because it did exist but was None
). (这就是为什么我不能只检查NaN
正常化后- None
转化为NaN
在浮动类型json_normalize
,因为它没有提供一个dtype
阿根廷,所以我不知道它是否是NaN
,因为钥匙没'不存在,或者因为它确实存在但None
)。
For example, I tried doing something like this:例如,我尝试做这样的事情:
data = [
{
'id': 1,
'topfield1': "1-1",
'topfield2': "1-2",
'topfield3': "1-3",
'topfield4': "1-4",
'payments': [
{'id': 1, 'amt': 2.0, 'not_required': 'something'},
{'id': 2, 'amt': 4.0}
]
},
{
'id': 2,
'topfield1': "2-1",
'topfield2': "2-2",
'topfield3': "2-3",
'topfield4': "2-4",
'payments': [
{'id': 1}
]
},
{
'id': 3,
'topfield1': "3-1",
'topfield2': "3-2",
'topfield3': "3-3",
'topfield4': "3-4",
'payments': []
}
]
# now flatten into one row for each item in each record's 'payments' key, keeping top-level 'id', 'topfield1', 'topfield4' and raising error if payments.id or payments.amt does not exist in a payment
#ideally, it would work like this:
# i want this to raise an error since data[1]['payments'][0] does not have key 'amt'. apparently that's not how json_normalize works -- it just throws a TypeError because apparently meta columns can't be from the record_path
pandas.io.json.json_normalize(data, record_path='payments', meta=['id', 'topfield4', 'topfield1', ['payments', 'id'], ['payments', 'amt']])
'''expected output
KeyError: 'payments.amt' or something like that
'''
'''actual output
TypeError: list indices must be integers or slices, not str
'''
data[1]['payments'][0]['amt'] = None
# now every 'payment' has 'id' and 'amt' keys so should succeed. but this still throws TypeError.
pandas.io.json.json_normalize(data, record_path='payments', meta=['id', 'topfield4', 'topfield1', ['payments', 'id'], ['payments', 'amt']])
'''expected output
id topfield4 topfield1 payments.id payments.amt payments.not_required
1 1-4 1-1 1 2.0 something
1 1-4 1-1 2 4.0 NaN
2 2-4 2-1 1 NaN NaN
'''
'''actual output
TypeError: list indices must be integers or slices, not str
'''
But this doesn't work.但这不起作用。 Whenever I use fields from the payments
objects -- ie keys of the objects in the record_path
list -- as fields in the meta
arg, I get TypeError: list indices must be integers or slices, not str
.每当我使用来自payments
对象的字段——即record_path
列表中对象的键——作为meta
参数中的字段时,我得到TypeError: list indices must be integers or slices, not str
。
It also seems like json_normalize
isn't smart enough to actually follow the paths you give it in the meta
arg, since: (though I guess I could just rename the relevant top-level columns to avoid)似乎json_normalize
还不够聪明,无法实际遵循您在meta
参数中提供的路径,因为:(尽管我想我可以重命名相关的顶级列以避免)
# data as from above
pandas.io.json.json_normalize(data, record_path='payments', meta=['id'])
# fails with ValueError: Conflicting metadata name id, need distinguishing prefix`
# (shouldn't it know `id` is top-level since it's not a nested path?)
Is there a vectorized/fast way to accomplish what I want to do?有没有一种矢量化/快速的方法来完成我想做的事情?
Thanks for any help.谢谢你的帮助。
EDIT 1: There could be many top level fields and I want a subset of them in the final dataframe.编辑 1:可能有许多顶级字段,我希望在最终数据帧中包含其中的一个子集。 EDIT 2: Added more fields to make a minimum reproduceable example.编辑 2:添加更多字段以制作最小可复制示例。
IIUC, we can flatten your dict with a simple comprehension and read the outer id as the keys to use as the index. IIUC,我们可以通过简单的理解将您的 dict 压平,并将外部 id 读取为用作索引的键。 we then concat whilst reading the values as a dataframe.然后我们在将值作为数据帧读取的同时进行连接。
df = pd.concat(
{d["id"]: pd.DataFrame.from_dict(d["payments"]) for d in data}, sort=False, axis=0
).reset_index(0).rename(columns={"id": "payments.id", "level_0": "id"})
print(df)
id payments.id amt not_required
0 1 1.0 2.0 something
1 1 2.0 4.0 NaN
0 2 1.0 NaN NaN
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.