简体   繁体   English

以理智的方式将字典列表转换为Dataframe

[英]Converting list of dictionaries into Dataframe in sane way

The zendesk api returns fields as a list of dictionaries but each list is a single record. zendesk api 将字段作为字典列表返回,但每个列表都是一条记录。 I'm wondering if there is a better way to turn it all into a dataframe.我想知道是否有更好的方法将其全部转换为数据框。 If it's a dictionary of dictionaries then json_normalize takes care of it with no problem.如果它是字典字典,那么json_normalize会毫无问题地处理它。

Caveat: not all records will have the same field IDs警告:并非所有记录都具有相同的字段 ID

Sample data:样本数据:

data = [{
  "ticket_id": 4,
  "customer_id": 8,
  "created_at": "2022-05-01",
  "custom_fields": [
    {
      "id": 15,
      "value": "website"
    },
    {
      "id": 16,
      "value": "broken"
    },
    {
      "id": 23,
      "value": None
    },
  ],
  'group_id': 42
}]

Running any form of Dataframe, from_records , from_json , or json_normalize gives most of what I want but with the list in a single column:运行任何形式的 Dataframe、 from_recordsfrom_jsonjson_normalize提供我想要的大部分内容,但列表在一列中:

t_df = pd.json_normalize(data)
t_df

Output:输出:

ticket_id票号 customer_id客户ID created_at created_at custom_fields自定义字段 group_id group_id
0 0 4 4 8 8 2022-05-01 2022-05-01 [{'id': 15, 'value': 'website'}, {'id': 16, 'v... [{'id': 15, 'value': '网站'}, {'id': 16, 'v... 42 42

My current, probably ill-advised, solution is:我目前的,可能是不明智的解决方案是:

t_df = pd.DataFrame(sample_df.at[0, 'custom_fields']).T.reset_index(drop=True)
t_df.rename(columns=t_df.iloc[0], inplace=True)
t_df.drop(0, inplace=True)
t_df.reset_index(drop=True, inplace=True)
pd.merge(left=sample_df, left_index=True,
         right=t_df, right_index=True).drop(columns='custom_fields')

This results in a correct record that I could append to a main dataframe:这会产生一个正确的记录,我可以将其附加到主数据框:

ticket_id票号 customer_id客户ID created_at created_at group_id group_id 15 15 16 16 23 23
0 0 4 4 8 8 2022-05-01 2022-05-01 42 42 website网站 broken破碎的 None没有任何

My worry is that i need to do this to ~25,000 records and this seems like it will be both slow and brittle (prone to breaking).我担心我需要对约 25,000 条记录执行此操作,这似乎既缓慢又脆弱(容易损坏)。

You should wrangle the data/dictionary first and only then construct a DataFrame with it.您应该先处理数据/字典,然后再用它构造一个 DataFrame。 It will make your life easier and is faster than trying to manipulate the data with pandas ie after the DataFrame is created.它将使您的生活更轻松,并且比尝试使用pandas操作数据(即在创建 DataFrame 之后)更快。

import pandas as pd

data = [{
  "ticket_id": 4,
  "customer_id": 8,
  "created_at": "2022-05-01",
  "custom_fields": [
    {
      "id": 15,
      "value": "website"
    },
    {
      "id": 16,
      "value": "broken"
    },
    {
      "id": 23,
      "value": None
    },
  ],
  'group_id': 42
}]

custom_fields = data[0].pop('custom_fields')
data[0].update({rec['id']: rec['value'] for rec in custom_fields})

t_df = pd.DataFrame(data)

Output:输出:

>>> t_df 

   ticket_id  customer_id  created_at  group_id       15      16    23
0          4            8  2022-05-01        42  website  broken  None

It looks like pandas isn't automatically determining which fields are "metadata" and which are "records" --> if your data is fixed, I would recommend hardcoding the following:看起来熊猫不会自动确定哪些字段是“元数据”,哪些是“记录”-->如果您的数据是固定的,我建议对以下内容进行硬编码:

>>> t_df = pd.json_normalize(
...     data,
...     meta=["ticket_id", "customer_id", "created_at", "group_id"],
...     record_path=["custom_fields"]
... )

   id    value ticket_id customer_id  created_at group_id
0  15  website         4           8  2022-05-01       42
1  16   broken         4           8  2022-05-01       42
2  23     None         4           8  2022-05-01       42

Documentation: https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html文档: https ://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html

import pandas as pd
data = [{
  "ticket_id": 4,
  "customer_id": 8,
  "created_at": "2022-05-01",
  "custom_fields": [
    {
      "id": 15,
      "value": "website"
    },
    {
      "id": 16,
      "value": "broken"
    },
    {
      "id": 23,
      "value": None
    },
  ],
  'group_id': 42
}]
df = pd.DataFrame(data)
for index in df.index:
    for i in df.loc[index,'custom_fields']:
        df.loc[index,i['id']] = i['value']
df.drop(columns = 'custom_fields',inplace = True)
df

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM