简体   繁体   English

Pandas:如何将 dicts 列表中的 dicts 列表展平到数据框中,如果嵌套列表中的任何 dict 缺少任何指定的键,则会抛出错误?

[英]Pandas: How to flatten lists of dicts within a list of dicts into dataframe, throwing error if any dict in nested list is missing any specified keys?

My ultimate goal is to flatten a key within a list of dicts into a dataframe.我的最终目标是将 dicts 列表中的一个键压平到一个数据帧中。 The key's value is also a list of dicts, and that list could be empty for any given record in the top-level list.键的值也是一个字典列表,对于顶级列表中的任何给定记录,该列表可以为空。 I'm trying to do this quickly, so trying to use vectorized operations in pandas, including json_normalize (which I assume is better than a loop).我正在尝试快速执行此操作,因此尝试在json_normalize使用矢量化操作,包括json_normalize (我认为它比循环更好)。

In the resulting dataframe, I want to keep some top-level columns while flattening all of the keys in the list of nested dicts.在生成的数据框中,我想保留一些顶级列,同时展平嵌套字典列表中的所有键。 I also want the operation to fail if there is any dict in any nested list that does not have all of the keys I specify.如果任何嵌套列表中没有包含我指定的所有键的任何字典,我希望操作失败。 But, I do not want the operation to fail if the key exists but is None .但是,如果键存在但为None ,我希望操作失败。 (This is why I can't just check NaN after the normalization -- None is converted to NaN for float types in json_normalize , since it does not provide a dtype arg, so I wouldn't know whether it was NaN because the key didn't exist, or because it did exist but was None ). (这就是为什么我不能只检查NaN正常化后- None转化为NaN在浮动类型json_normalize ,因为它没有提供一个dtype阿根廷,所以我不知道它是否是NaN ,因为钥匙没'不存在,或者因为它确实存在但None )。

For example, I tried doing something like this:例如,我尝试做这样的事情:

data = [
  {
    'id': 1, 
    'topfield1': "1-1",
    'topfield2': "1-2",
    'topfield3': "1-3",
    'topfield4': "1-4",
    'payments': [
      {'id': 1, 'amt': 2.0, 'not_required': 'something'},
      {'id': 2, 'amt': 4.0}
    ]
  },
  {
    'id': 2, 
    'topfield1': "2-1",
    'topfield2': "2-2",
    'topfield3': "2-3",
    'topfield4': "2-4",
    'payments': [
      {'id': 1}
    ]
  },
  {
    'id': 3, 
    'topfield1': "3-1",
    'topfield2': "3-2",
    'topfield3': "3-3",
    'topfield4': "3-4",
    'payments': []
  }
]

# now flatten into one row for each item in each record's 'payments' key, keeping top-level 'id', 'topfield1', 'topfield4' and raising error if payments.id or payments.amt does not exist in a payment

#ideally, it would work like this:

# i want this to raise an error since data[1]['payments'][0] does not have key 'amt'. apparently that's not how json_normalize works -- it just throws a TypeError because apparently meta columns can't be from the record_path
pandas.io.json.json_normalize(data, record_path='payments', meta=['id', 'topfield4', 'topfield1', ['payments', 'id'], ['payments', 'amt']]) 
'''expected output
KeyError: 'payments.amt' or something like that
'''
'''actual output
TypeError: list indices must be integers or slices, not str
'''

data[1]['payments'][0]['amt'] = None
# now every 'payment' has 'id' and 'amt' keys so should succeed. but this still throws TypeError.
pandas.io.json.json_normalize(data, record_path='payments', meta=['id', 'topfield4', 'topfield1', ['payments', 'id'], ['payments', 'amt']]) 
'''expected output
id  topfield4  topfield1  payments.id  payments.amt  payments.not_required
1      1-4     1-1             1            2.0            something
1      1-4     1-1             2            4.0              NaN
2      2-4     2-1             1            NaN              NaN
'''
'''actual output
TypeError: list indices must be integers or slices, not str
'''

But this doesn't work.但这不起作用。 Whenever I use fields from the payments objects -- ie keys of the objects in the record_path list -- as fields in the meta arg, I get TypeError: list indices must be integers or slices, not str .每当我使用来自payments对象的字段——即record_path列表中对象的键——作为meta参数中的字段时,我得到TypeError: list indices must be integers or slices, not str

It also seems like json_normalize isn't smart enough to actually follow the paths you give it in the meta arg, since: (though I guess I could just rename the relevant top-level columns to avoid)似乎json_normalize还不够聪明,无法实际遵循您在meta参数中提供的路径,因为:(尽管我想我可以重命名相关的顶级列以避免)

# data as from above
pandas.io.json.json_normalize(data, record_path='payments', meta=['id'])
# fails with ValueError: Conflicting metadata name id, need distinguishing prefix`
# (shouldn't it know `id` is top-level since it's not a nested path?)

Is there a vectorized/fast way to accomplish what I want to do?有没有一种矢量化/快速的方法来完成我想做的事情?

Thanks for any help.谢谢你的帮助。

EDIT 1: There could be many top level fields and I want a subset of them in the final dataframe.编辑 1:可能有许多顶级字段,我希望在最终数据帧中包含其中的一个子集。 EDIT 2: Added more fields to make a minimum reproduceable example.编辑 2:添加更多字段以制作最小可复制示例。

IIUC, we can flatten your dict with a simple comprehension and read the outer id as the keys to use as the index. IIUC,我们可以通过简单的理解将您的 dict 压平,并将外部 id 读取为用作索引的键。 we then concat whilst reading the values as a dataframe.然后我们在将值作为数据帧读取的同时进行连接。

df = pd.concat(
    {d["id"]: pd.DataFrame.from_dict(d["payments"]) for d in data}, sort=False, axis=0
).reset_index(0).rename(columns={"id": "payments.id", "level_0": "id"})

print(df)

   id  payments.id  amt not_required
0   1          1.0  2.0    something
1   1          2.0  4.0          NaN
0   2          1.0  NaN          NaN

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM