[英]Convert json column into standard pandas dataframe
I have a pandas dataframe with a column in json format like below.我有一个 pandas dataframe ,其中有一列 json 格式,如下所示。
id ![]() |
date![]() |
gender![]() |
response![]() |
---|---|---|---|
1 ![]() |
1/14/2021 ![]() |
M![]() |
"{'score':3,'reason':{'description':array(['a','b','c'])}" ![]() |
2 ![]() |
5/16/2020 ![]() |
F ![]() |
"{'score':4,'reason':{'description':array(['x','y','z'])}" ![]() |
I want to convert this into a dataframe by flattening the dictionary in the response column.我想通过展平响应列中的字典将其转换为 dataframe。 The dictionary is stored as a string in the database.
字典以字符串形式存储在数据库中。
Is there an easy way in python to convert the response column into a dictionary object and then flatten it to a dataframe like this: python 中是否有一种简单的方法可以将响应列转换为字典 object,然后将其展平为 dataframe,如下所示:
id ![]() |
date![]() |
gender![]() |
score![]() |
description![]() |
---|---|---|---|---|
1 ![]() |
1/14/2021 ![]() |
M![]() |
3 ![]() |
a![]() |
1 ![]() |
1/14/2021 ![]() |
M![]() |
3 ![]() |
b ![]() |
1 ![]() |
1/14/2021 ![]() |
M![]() |
3 ![]() |
c ![]() |
2 ![]() |
5/16/2020 ![]() |
F ![]() |
4 ![]() |
x ![]() |
2 ![]() |
5/16/2020 ![]() |
F ![]() |
4 ![]() |
y![]() |
2 ![]() |
5/16/2020 ![]() |
F ![]() |
4 ![]() |
z ![]() |
Given the dataframe you provided:鉴于您提供的 dataframe:
import pandas as pd
df = pd.DataFrame(
{
"id": [1, 2],
"date": ["1/14/2021", "5/16/2020"],
"gender": ["M", "F"],
"response": [
"{'score':3,'reason':{'description':array(['a','b','c'])}",
"{'score':4,'reason':{'description':array(['x','y','z'])}",
],
}
)
You can define a function to flatten the values in response
column:您可以定义 function 以展平
response
列中的值:
def flatten(data, new_data):
"""Recursive helper function.
Args:
data: nested dictionary.
new_data: empty dictionary.
Returns:
Flattened dictionary.
"""
for key, value in data.items():
if isinstance(value, list):
for item in value:
flatten(item, new_data)
if isinstance(value, dict):
flatten(value, new_data)
if (
isinstance(value, str)
or isinstance(value, int)
or isinstance(value, ndarray)
):
new_data[key] = value
return new_data
And then, proceed like this using Numpy ndarrays to take care of the arrays and Python standard libray eval built-in function to make dictionaries from the strings in response
column: And then, proceed like this using Numpy ndarrays to take care of the arrays and Python standard libray eval built-in function to make dictionaries from the strings in
response
column:
import numpy as np
from numpy import ndarray
# In your example, closing curly braces are missing, hence the "+ '}'"
df["response"] = df["response"].apply(
lambda x: flatten(eval(x.replace("array", "np.array") + "}"), {})
)
# For each row, flatten nested dict, make a dataframe of it
# and concat it with non nested columns
# Then, concat all new dataframes
new_df = pd.concat(
[
pd.concat(
[
pd.DataFrame(df.loc[idx, :]).T.drop(columns="response"),
pd.DataFrame(df.loc[idx, "response"]).reset_index(drop=True),
],
axis=1,
).fillna(method="ffill")
for idx in df.index
]
).reset_index(drop=True)
So that:以便:
print(new_df)
# Output
id date gender score description
0 1 1/14/2021 M 3 a
1 1 1/14/2021 M 3 b
2 1 1/14/2021 M 3 c
3 2 5/16/2020 F 4 y
4 2 5/16/2020 F 4 x
5 2 5/16/2020 F 4 z
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.