[英]Converting list of dictionaries into Dataframe in sane way
The zendesk api returns fields as a list of dictionaries but each list is a single record. zendesk api 将字段作为字典列表返回,但每个列表都是一条记录。 I'm wondering if there is a better way to turn it all into a dataframe.
我想知道是否有更好的方法将其全部转换为数据框。 If it's a dictionary of dictionaries then
json_normalize
takes care of it with no problem.如果它是字典字典,那么
json_normalize
会毫无问题地处理它。
Caveat: not all records will have the same field IDs警告:并非所有记录都具有相同的字段 ID
Sample data:样本数据:
data = [{
"ticket_id": 4,
"customer_id": 8,
"created_at": "2022-05-01",
"custom_fields": [
{
"id": 15,
"value": "website"
},
{
"id": 16,
"value": "broken"
},
{
"id": 23,
"value": None
},
],
'group_id': 42
}]
Running any form of Dataframe, from_records
, from_json
, or json_normalize
gives most of what I want but with the list in a single column:运行任何形式的 Dataframe、
from_records
、 from_json
或json_normalize
提供我想要的大部分内容,但列表在一列中:
t_df = pd.json_normalize(data)
t_df
Output:输出:
ticket_id![]() |
customer_id![]() |
created_at ![]() |
custom_fields![]() |
group_id ![]() |
|
---|---|---|---|---|---|
0 ![]() |
4 ![]() |
8 ![]() |
2022-05-01 ![]() |
[{'id': 15, 'value': 'website'}, {'id': 16, 'v... ![]() |
42 ![]() |
My current, probably ill-advised, solution is:我目前的,可能是不明智的解决方案是:
t_df = pd.DataFrame(sample_df.at[0, 'custom_fields']).T.reset_index(drop=True)
t_df.rename(columns=t_df.iloc[0], inplace=True)
t_df.drop(0, inplace=True)
t_df.reset_index(drop=True, inplace=True)
pd.merge(left=sample_df, left_index=True,
right=t_df, right_index=True).drop(columns='custom_fields')
This results in a correct record that I could append to a main dataframe:这会产生一个正确的记录,我可以将其附加到主数据框:
ticket_id![]() |
customer_id![]() |
created_at ![]() |
group_id ![]() |
15 ![]() |
16 ![]() |
23 ![]() |
|
---|---|---|---|---|---|---|---|
0 ![]() |
4 ![]() |
8 ![]() |
2022-05-01 ![]() |
42 ![]() |
website![]() |
broken![]() |
None![]() |
My worry is that i need to do this to ~25,000 records and this seems like it will be both slow and brittle (prone to breaking).我担心我需要对约 25,000 条记录执行此操作,这似乎既缓慢又脆弱(容易损坏)。
You should wrangle the data/dictionary first and only then construct a DataFrame with it.您应该先处理数据/字典,然后再用它构造一个 DataFrame。 It will make your life easier and is faster than trying to manipulate the data with
pandas
ie after the DataFrame is created.它将使您的生活更轻松,并且比尝试使用
pandas
操作数据(即在创建 DataFrame 之后)更快。
import pandas as pd
data = [{
"ticket_id": 4,
"customer_id": 8,
"created_at": "2022-05-01",
"custom_fields": [
{
"id": 15,
"value": "website"
},
{
"id": 16,
"value": "broken"
},
{
"id": 23,
"value": None
},
],
'group_id': 42
}]
custom_fields = data[0].pop('custom_fields')
data[0].update({rec['id']: rec['value'] for rec in custom_fields})
t_df = pd.DataFrame(data)
Output:输出:
>>> t_df
ticket_id customer_id created_at group_id 15 16 23
0 4 8 2022-05-01 42 website broken None
It looks like pandas isn't automatically determining which fields are "metadata" and which are "records" --> if your data is fixed, I would recommend hardcoding the following:看起来熊猫不会自动确定哪些字段是“元数据”,哪些是“记录”-->如果您的数据是固定的,我建议对以下内容进行硬编码:
>>> t_df = pd.json_normalize(
... data,
... meta=["ticket_id", "customer_id", "created_at", "group_id"],
... record_path=["custom_fields"]
... )
id value ticket_id customer_id created_at group_id
0 15 website 4 8 2022-05-01 42
1 16 broken 4 8 2022-05-01 42
2 23 None 4 8 2022-05-01 42
Documentation: https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html文档: https ://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
import pandas as pd
data = [{
"ticket_id": 4,
"customer_id": 8,
"created_at": "2022-05-01",
"custom_fields": [
{
"id": 15,
"value": "website"
},
{
"id": 16,
"value": "broken"
},
{
"id": 23,
"value": None
},
],
'group_id': 42
}]
df = pd.DataFrame(data)
for index in df.index:
for i in df.loc[index,'custom_fields']:
df.loc[index,i['id']] = i['value']
df.drop(columns = 'custom_fields',inplace = True)
df
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.