简体   繁体   English

使用 apply/lambda 解析字典列表

[英]Parse a list of dictionaries with apply/lambda

I have a huge dataframe in which a certain column has a list of dictionaries (it is the school history of several people).我有一个巨大的 dataframe ,其中某个列有一个字典列表(这是几个人的学校历史)。 So, what I'm trying to do is parsing this data to a new dataframe (because the relation is going to be 1 person to many schools).所以,我想要做的是将这些数据解析为新的 dataframe (因为关系将是 1 人到许多学校)。

However, my first option was to loop over the dataframe with itertuples().但是,我的第一个选择是使用 itertuples() 循环遍历 dataframe。 Too slow!太慢了!

Each list looks like this:每个列表如下所示:

list_of_dicts = {
    0: '[]',
    1: "[{'name': 'USA Health', 'subject': 'Residency, Internal Medicine, 2006 - 2009'}, {'name': 'Ross University School of Medicine', 'subject': 'Class of 2005'}]",
    2: "[{'name': 'Physicians Medical Center Carraway', 'subject': 'Residency, Surgery, 1957 - 1960'}, {'name': 'Physicians Medical Center Carraway', 'subject': 'Internship, Transitional Year, 1954 - 1955'}, {'name': 'University of Alabama School of Medicine', 'subject': 'Class of 1954'}]"
}

df_dict = pd.DataFrame.from_dict(list_of_dicts, orient='index', columns=['school_history'])

What I thought about, was to have a function and them apply it to the dataframe:我的想法是拥有一个 function 并将其应用于 dataframe:

def parse_item(row):
    eval_dict = eval(row)[0]
    school_df = pd.DataFrame.from_dict(eval_dict, orient='index').T
    return school_df

df['column'].apply(lambda x: parse_item(x))

However, I'm not able to figure out how to generate a dataframe bigger than original (due to situations of multiple schools to one person).但是,我无法弄清楚如何生成比原来更大的 dataframe(由于多个学校对一个人的情况)。 Any ideas?有任何想法吗?

From those 3 rows, the idea is to have this dataframe (that has 5 rows from 2 rows):从这 3 行中,我们的想法是拥有这个 dataframe (从 2 行中有 5 行): 在此处输入图像描述

Iterate over the column to convert each string into a python list using ast.literal_eval() ;使用ast.literal_eval()遍历列以将每个字符串转换为 python 列表; the result is a nested list, which can be flattened inside the same comprehension.结果是一个嵌套列表,可以在同一个理解中展平。

from ast import literal_eval
pd.DataFrame([x for row in df_dict['school_history'] for x in literal_eval(row)])

资源

This does the trick using your sample data (thanks for the performance tip in comments ):这可以使用您的示例数据来解决问题(感谢评论中的性能提示):

list_df = df_dict.school_history.map(ast.literal_eval)
exploded = list_df[list_df.str.len() > 0].explode()
final = pd.DataFrame(list(exploded), index=exploded.index)

This produces the following:这会产生以下结果:

In [54]: final
Out[54]:
                                       name                                     subject
1                                USA Health   Residency, Internal Medicine, 2006 - 2009
1        Ross University School of Medicine                               Class of 2005
2        Physicians Medical Center Carraway             Residency, Surgery, 1957 - 1960
2        Physicians Medical Center Carraway  Internship, Transitional Year, 1954 - 1955
2  University of Alabama School of Medicine                               Class of 1954

This will probably not be super fast given the amount of data, but parsing a dictionary of strings with nested objects inside will probably be pretty slow no matter what.考虑到数据量,这可能不会非常快,但是无论如何解析带有嵌套对象的字符串字典可能会非常慢。 You're probably better off parsing the file upstream first, then converting to pandas.您最好先解析上游文件,然后转换为 pandas。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM