在 Python Pandas 中创建优先级

Question

我有一个大型数据集，显示了个人拥有的每个学位以及获得学位的年份。 而且，每个人都有一个对应的ID。 我试图使用完成学位的年份和完成学位的平均年龄来找到每个人的出生年份。 数据集如下所示：

对于平均年龄，我假设博士是 33 岁，硕士是 30 岁，学士学位是 22 岁。

person_id   degree                       degree_completion   year_of_birth
1           PhD                          2006                1973
1           BSc                          1999                1977
2           Ph.D.                        1995                1962
2           MBA                          2000                1970
2           B.A.                         1987                1965
3           Bachelor of Engineering      2005                1983
4           AB                           1997                1975
4           Doctor of Philosophy (PhD)   2003                1970

我已经创建了计算每个人出生年份的系统，但我无法弄清楚如何创建一个优先级系统，以便它选择正确的出生年份，因为可以为个人的每个学位计算不同的出生年份。我想要以下优先顺序：学士的出生年份 > 博士的出生年份 > 硕士的出生年份。

我已经尝试了许多使用 groupby 函数和 Categorial 数据类型的方法。 此外，数据集中有数百种不同的学位形式，因此我一直依赖于使用正则表达式来计算出生年份并创建优先级系统。 这就是我目前所拥有的，但我找不到一种方法来实现正则表达式：

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

file['edu_degree'] = pd.Categorical(file['edu_degree'], ordered=True, categories=[category1, category2, category3])

file.groupby('person_id')['edu_degree'].transform('max')

此外，这将是我想要的输出（根据优先级替换出生年份）：

person_id   degree                       degree_completion   year_of_birth
1           PhD                          2006                1977
1           BSc                          1999                1977
2           Ph.D.                        1995                1965
2           MBA                          2000                1965
2           B.A.                         1987                1965
3           Bachelor of Engineering      2005                1983
4           AB                           1997                1975
4           Doctor of Philosophy (PhD)   2003                1975

Answer 1

这是一个想法，可能不是最优雅的想法（假设您的框架名为df ）：

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")
def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)
df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())
df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

一些解释：

第 1 步：对您的正则表达式进行一些重组。

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")

我已经调整了模式，以便 (1) 列degree的完整条目匹配，(2) 包含更多可能性，以及 (3) 转义. s。 您可能需要进一步调整它！ 我将类别打包成组并通过|将它们连接起来| .

第 2 步：创建degree_cat列（= 相应学位的类别）。

def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)

我使用category作为repl函数，它基本上用它们的类别替换匹配项。 在这里检查它是如何工作的。 strip只是一种预防措施。 您的示例的结果列如下所示：

0    2
1    1
2    2
3    3
4    1
5    1
6    1
7    2
Name: degree_cat, dtype: int64

第 3 步：选择所需的出生年份。

df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())

这里df按新列排序，按person_id分组，然后选择year_of_birth的第一项（由于排序，这是所需的年份）。 您的样品的结果：

   person_id  year_of_birth
0          1           1977
1          2           1965
2          3           1983
3          4           1975

第 4 步：将year_of_birth的值替换为year_of_birth的值。

df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

删除旧的year_of_birth和degree_cat列，因为不再需要它们，并沿person_id合并df上的df_year_of_birth数据person_id以重新创建正确的year_of_birth列。

最终结果：

   person_id                      degree  degree_completion  year_of_birth
0          1                         PhD               2006           1977
1          1                         BSc               1999           1977
2          2                       Ph.D.               1995           1965
3          2                         MBA               2000           1965
4          2                        B.A.               1987           1965
5          3     Bachelor of Engineering               2005           1983
6          4                          AB               1997           1975
7          4  Doctor of Philosophy (PhD)               2003           1975

Answer 2

这是一种可能的解决方案，也许不是最优雅的解决方案，但仍然可以发挥作用

# define custom function to get the correct years
def find_best_year(df):

    cond1 = df['degree'].str.match(category1)
    cond2 = df['degree'].str.match(category2)
    cond3 = df['degree'].str.match(category3)
    
    if cond1.any():
        return df.loc[cond1, 'year_of_birth']
    elif cond2.any():
        return df.loc[cond2, 'year_of_birth']
    elif cond3.any():
        return df.loc[cond3, 'year_of_birth']
    else:
        raise ValueError("No condition was found.")


# create lookup table with best years
lookup_df = file\
    .groupby('person_id')\
    .apply(find_best_year)\
    .reset_index()\
    .drop(columns=['level_1'])
print(lookup_df)
#    person_id  year_of_birth
# 0          1           1977
# 1          2           1965
# 2          3           1983
# 3          4           1975


# desired output
file\
    .drop(columns=['year_of_birth'])\
    .merge(lookup_df, on='person_id', how='left')
#    person_id                      degree  degree_completion  year_of_birth
# 0          1                         PhD               2006           1977
# 1          1                         BSc               1999           1977
# 2          2                       Ph.D.               1995           1965
# 3          2                         MBA               2000           1965
# 4          2                        B.A.               1987           1965
# 5          3     Bachelor of Engineering               2005           1983
# 6          4                          AB               1997           1975
# 7          4  Doctor of Philosophy (PhD)               2003           1975

Answer 3

要应用正则表达式，您可以创建一个函数 ( get_diploma ) 来一个接一个地测试它们。 理想情况下，按照最可能的顺序（学士学位优先）。

然后您可以按 person_id 分组并找到具有最高优先级的行（ get_expected_age函数）。

import re

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

diplomas = {category1: 'Bachelor', category2: 'PhD', category3: 'Master'}
ages = {'PhD': 33, 'Master': 30, 'Bachelor': 22}


def get_diploma(s):
    # for first matching regexp, return diploma
    for k in diplomas:
        if re.match(k, s):
            return diplomas[k]
    

        
df['degree_standardized'] = pd.Categorical(df['degree'].map(get_diploma),
                                           ordered=True,
                                           categories=['Master', 'PhD', 'Bachelor'])
# map the age from the standardized degree. NB. this could be fused with the previous step.
df['expected_age'] = df['degree_standardized'].map(ages)

def get_expected_age(d):
    # get degree with highest priority
    s = d.sort_values(by='degree_standardized').iloc[-1]
    d['year_of_birth'] = s['degree_completion']-s['expected_age']
    return d

df.groupby('person_id').apply(get_expected_age)

输出：

   person_id                      degree  degree_completion  year_of_birth degree_standardized expected_age
0          1                         PhD               2006           1977                 PhD           33
1          1                         BSc               1999           1977            Bachelor           22
2          2                       Ph.D.               1995           1965                 PhD           33
3          2                         MBA               2000           1965              Master           30
4          2                        B.A.               1987           1965            Bachelor           22
5          3     Bachelor of Engineering               2005           1983            Bachelor           22
6          4                          AB               1997           1975            Bachelor           22
7          4  Doctor of Philosophy (PhD)               2003           1975                 PhD           33

在 Python Pandas 中创建优先级

问题描述

3 个解决方案

解决方案1
0 2021-07-12 09:03:29

解决方案2
0 2021-07-12 09:53:10

解决方案3
0 2021-07-12 10:00:45

在 Python Pandas 中创建优先级

问题描述

3 个解决方案

解决方案1 0 2021-07-12 09:03:29

解决方案2 0 2021-07-12 09:53:10

解决方案3 0 2021-07-12 10:00:45

解决方案1
0 2021-07-12 09:03:29

解决方案2
0 2021-07-12 09:53:10

解决方案3
0 2021-07-12 10:00:45