[英]Parse a list of dictionaries with apply/lambda
I have a huge dataframe in which a certain column has a list of dictionaries (it is the school history of several people).我有一个巨大的 dataframe ,其中某个列有一个字典列表(这是几个人的学校历史)。 So, what I'm trying to do is parsing this data to a new dataframe (because the relation is going to be 1 person to many schools).
所以,我想要做的是将这些数据解析为新的 dataframe (因为关系将是 1 人到许多学校)。
However, my first option was to loop over the dataframe with itertuples().但是,我的第一个选择是使用 itertuples() 循环遍历 dataframe。 Too slow!
太慢了!
Each list looks like this:每个列表如下所示:
list_of_dicts = {
0: '[]',
1: "[{'name': 'USA Health', 'subject': 'Residency, Internal Medicine, 2006 - 2009'}, {'name': 'Ross University School of Medicine', 'subject': 'Class of 2005'}]",
2: "[{'name': 'Physicians Medical Center Carraway', 'subject': 'Residency, Surgery, 1957 - 1960'}, {'name': 'Physicians Medical Center Carraway', 'subject': 'Internship, Transitional Year, 1954 - 1955'}, {'name': 'University of Alabama School of Medicine', 'subject': 'Class of 1954'}]"
}
df_dict = pd.DataFrame.from_dict(list_of_dicts, orient='index', columns=['school_history'])
What I thought about, was to have a function and them apply it to the dataframe:我的想法是拥有一个 function 并将其应用于 dataframe:
def parse_item(row):
eval_dict = eval(row)[0]
school_df = pd.DataFrame.from_dict(eval_dict, orient='index').T
return school_df
df['column'].apply(lambda x: parse_item(x))
However, I'm not able to figure out how to generate a dataframe bigger than original (due to situations of multiple schools to one person).但是,我无法弄清楚如何生成比原来更大的 dataframe(由于多个学校对一个人的情况)。 Any ideas?
有任何想法吗?
From those 3 rows, the idea is to have this dataframe (that has 5 rows from 2 rows):从这 3 行中,我们的想法是拥有这个 dataframe (从 2 行中有 5 行):
Iterate over the column to convert each string into a python list using ast.literal_eval()
;使用
ast.literal_eval()
遍历列以将每个字符串转换为 python 列表; the result is a nested list, which can be flattened inside the same comprehension.结果是一个嵌套列表,可以在同一个理解中展平。
from ast import literal_eval
pd.DataFrame([x for row in df_dict['school_history'] for x in literal_eval(row)])
This does the trick using your sample data (thanks for the performance tip in comments ):这可以使用您的示例数据来解决问题(感谢评论中的性能提示):
list_df = df_dict.school_history.map(ast.literal_eval)
exploded = list_df[list_df.str.len() > 0].explode()
final = pd.DataFrame(list(exploded), index=exploded.index)
This produces the following:这会产生以下结果:
In [54]: final
Out[54]:
name subject
1 USA Health Residency, Internal Medicine, 2006 - 2009
1 Ross University School of Medicine Class of 2005
2 Physicians Medical Center Carraway Residency, Surgery, 1957 - 1960
2 Physicians Medical Center Carraway Internship, Transitional Year, 1954 - 1955
2 University of Alabama School of Medicine Class of 1954
This will probably not be super fast given the amount of data, but parsing a dictionary of strings with nested objects inside will probably be pretty slow no matter what.考虑到数据量,这可能不会非常快,但是无论如何解析带有嵌套对象的字符串字典可能会非常慢。 You're probably better off parsing the file upstream first, then converting to pandas.
您最好先解析上游文件,然后转换为 pandas。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.