简体   繁体   English

在 Python Pandas 中创建优先级

[英]Creating Prioritization in Python Pandas

I have a large dataset that shows every degree that an individual has and the year that it was obtained.我有一个大型数据集,显示了个人拥有的每个学位以及获得学位的年份。 Also, each individual has a corresponding ID.而且,每个人都有一个对应的ID。 I am trying to find the year of birth of each individual using the year the degree was completed and the average age that degree is completed.我试图使用完成学位的年份和完成学位的平均年龄来找到每个人的出生年份。 The data set looks like the following:数据集如下所示:

For the average ages, I am assuming PhD is completed at 33, Master's at 30, and Bachelor's at 22.对于平均年龄,我假设博士是 33 岁,硕士是 30 岁,学士学位是 22 岁。

person_id   degree                       degree_completion   year_of_birth
1           PhD                          2006                1973
1           BSc                          1999                1977
2           Ph.D.                        1995                1962
2           MBA                          2000                1970
2           B.A.                         1987                1965
3           Bachelor of Engineering      2005                1983
4           AB                           1997                1975
4           Doctor of Philosophy (PhD)   2003                1970                          

I have already created the system that calculates the year of birth of each individual, but I cannot figure how to create a priority system so that it picks the correct year of birth as there can be a different one calculated for each degree the individual has.我已经创建了计算每个人出生年份的系统,但我无法弄清楚如何创建一个优先级系统,以便它选择正确的出生年份,因为可以为个人的每个学位计算不同的出生年份。 I want the following prioritization: Bachelor's year of birth > PhD year of birth > Master's year of birth.我想要以下优先顺序:学士的出生年份 > 博士的出生年份 > 硕士的出生年份。

I have tried numerous things with the groupby function and the Categorial datatype.我已经尝试了许多使用 groupby 函数和 Categorial 数据类型的方法。 Also, there are hundreds of different forms the degrees are written in within the dataset so I have been depending on using regular expressions to both calculate the year of birth and create the prioritization system.此外,数据集中有数百种不同的学位形式,因此我一直依赖于使用正则表达式来计算出生年份并创建优先级系统。 This is what I have currently, but I cannot find a way to implement regex into this:这就是我目前所拥有的,但我找不到一种方法来实现正则表达式:

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

file['edu_degree'] = pd.Categorical(file['edu_degree'], ordered=True, categories=[category1, category2, category3])

file.groupby('person_id')['edu_degree'].transform('max')

Also, this would be my desired output (year of births are replaced according to priority):此外,这将是我想要的输出(根据优先级替换出生年份):

person_id   degree                       degree_completion   year_of_birth
1           PhD                          2006                1977
1           BSc                          1999                1977
2           Ph.D.                        1995                1965
2           MBA                          2000                1965
2           B.A.                         1987                1965
3           Bachelor of Engineering      2005                1983
4           AB                           1997                1975
4           Doctor of Philosophy (PhD)   2003                1975                          

Here's an idea, probably not the most elegant one (assumption is that your frame is named df ):这是一个想法,可能不是最优雅的想法(假设您的框架名为df ):

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")
def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)
df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())
df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

Some explanations:一些解释:

Step 1 : A bit of reorganizing of your regex.第 1 步:对您的正则表达式进行一些重组。

import re

re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
                         + r"(PhD|Ph\.D\.|Doctor.*)|"
                         + r"(MBA|Master.*))$")

I've adjusted the patterns such that (1) the complete entry of column degree is matched, (2) included more possibities, and (3) escaped the .我已经调整了模式,以便 (1) 列degree的完整条目匹配,(2) 包含更多可能性,以及 (3) 转义. s. s。 It's likely that you have to adjust it further!您可能需要进一步调整它! And I packed the categories in groups and concatenated them by |我将类别打包成组并通过|将它们连接起来| . .

Step 2 : Creation of a degree_cat column (= category of the respective degree).第 2 步:创建degree_cat列(= 相应学位的类别)。

def category(match):
    for i, group in enumerate(match.group(1, 2, 3), start=1):
        if group:
            return str(i)

df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)

I've used category as a repl -function which essentially replaces matches with their category.我使用category作为repl函数,它基本上用它们的类别替换匹配项。 Check here how that works. 在这里检查它是如何工作的。 The strip is just a precaution. strip只是一种预防措施。 The resulting column for your sample looks like:您的示例的结果列如下所示:

0    2
1    1
2    2
3    3
4    1
5    1
6    1
7    2
Name: degree_cat, dtype: int64

Step 3 : Select the required years of birth.第 3 步:选择所需的出生年份。

df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
                      .year_of_birth.first())

Here df gets sorted by the new column, grouped by person_id , and then the first item in year_of_birth is selected (which is the required year due to the sorting).这里df按新列排序,按person_id分组,然后选择year_of_birth的第一项(由于排序,这是所需的年份)。 Result for your sample:您的样品的结果:

   person_id  year_of_birth
0          1           1977
1          2           1965
2          3           1983
3          4           1975

Step 4 : Replace the values in year_of_birth with the requiered values.第 4 步:将year_of_birth的值替换为year_of_birth的值。

df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')

Drop the old year_of_birth and the degree_cat column because they aren't needed any more, and the merge the df_year_of_birth dataframe on df along person_id to recreate the right year_of_birth column.删除旧的year_of_birthdegree_cat列,因为不再需要它们,并沿person_id合并df上的df_year_of_birth数据person_id以重新创建正确的year_of_birth列。

End result:最终结果:

   person_id                      degree  degree_completion  year_of_birth
0          1                         PhD               2006           1977
1          1                         BSc               1999           1977
2          2                       Ph.D.               1995           1965
3          2                         MBA               2000           1965
4          2                        B.A.               1987           1965
5          3     Bachelor of Engineering               2005           1983
6          4                          AB               1997           1975
7          4  Doctor of Philosophy (PhD)               2003           1975

This is one possible solution, maybe not the most elegant one but still does its job这是一种可能的解决方案,也许不是最优雅的解决方案,但仍然可以发挥作用

# define custom function to get the correct years
def find_best_year(df):

    cond1 = df['degree'].str.match(category1)
    cond2 = df['degree'].str.match(category2)
    cond3 = df['degree'].str.match(category3)
    
    if cond1.any():
        return df.loc[cond1, 'year_of_birth']
    elif cond2.any():
        return df.loc[cond2, 'year_of_birth']
    elif cond3.any():
        return df.loc[cond3, 'year_of_birth']
    else:
        raise ValueError("No condition was found.")


# create lookup table with best years
lookup_df = file\
    .groupby('person_id')\
    .apply(find_best_year)\
    .reset_index()\
    .drop(columns=['level_1'])
print(lookup_df)
#    person_id  year_of_birth
# 0          1           1977
# 1          2           1965
# 2          3           1983
# 3          4           1975


# desired output
file\
    .drop(columns=['year_of_birth'])\
    .merge(lookup_df, on='person_id', how='left')
#    person_id                      degree  degree_completion  year_of_birth
# 0          1                         PhD               2006           1977
# 1          1                         BSc               1999           1977
# 2          2                       Ph.D.               1995           1965
# 3          2                         MBA               2000           1965
# 4          2                        B.A.               1987           1965
# 5          3     Bachelor of Engineering               2005           1983
# 6          4                          AB               1997           1975
# 7          4  Doctor of Philosophy (PhD)               2003           1975

To apply the regex, you can make a function ( get_diploma ) to test them one after the other.要应用正则表达式,您可以创建一个函数 ( get_diploma ) 来一个接一个地测试它们。 Ideally in the order of the most probable ones (Bachelor first).理想情况下,按照最可能的顺序(学士学位优先)。

Then you can group by person_id and find the line with highest priority ( get_expected_age function).然后您可以按 person_id 分组并找到具有最高优先级的行( get_expected_age函数)。

import re

category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"

diplomas = {category1: 'Bachelor', category2: 'PhD', category3: 'Master'}
ages = {'PhD': 33, 'Master': 30, 'Bachelor': 22}


def get_diploma(s):
    # for first matching regexp, return diploma
    for k in diplomas:
        if re.match(k, s):
            return diplomas[k]
    

        
df['degree_standardized'] = pd.Categorical(df['degree'].map(get_diploma),
                                           ordered=True,
                                           categories=['Master', 'PhD', 'Bachelor'])
# map the age from the standardized degree. NB. this could be fused with the previous step.
df['expected_age'] = df['degree_standardized'].map(ages)

def get_expected_age(d):
    # get degree with highest priority
    s = d.sort_values(by='degree_standardized').iloc[-1]
    d['year_of_birth'] = s['degree_completion']-s['expected_age']
    return d

df.groupby('person_id').apply(get_expected_age)

output:输出:

   person_id                      degree  degree_completion  year_of_birth degree_standardized expected_age
0          1                         PhD               2006           1977                 PhD           33
1          1                         BSc               1999           1977            Bachelor           22
2          2                       Ph.D.               1995           1965                 PhD           33
3          2                         MBA               2000           1965              Master           30
4          2                        B.A.               1987           1965            Bachelor           22
5          3     Bachelor of Engineering               2005           1983            Bachelor           22
6          4                          AB               1997           1975            Bachelor           22
7          4  Doctor of Philosophy (PhD)               2003           1975                 PhD           33

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM