[英]Creating Prioritization in Python Pandas
I have a large dataset that shows every degree that an individual has and the year that it was obtained.我有一个大型数据集,显示了个人拥有的每个学位以及获得学位的年份。 Also, each individual has a corresponding ID.而且,每个人都有一个对应的ID。 I am trying to find the year of birth of each individual using the year the degree was completed and the average age that degree is completed.我试图使用完成学位的年份和完成学位的平均年龄来找到每个人的出生年份。 The data set looks like the following:数据集如下所示:
For the average ages, I am assuming PhD is completed at 33, Master's at 30, and Bachelor's at 22.对于平均年龄,我假设博士是 33 岁,硕士是 30 岁,学士学位是 22 岁。
person_id degree degree_completion year_of_birth
1 PhD 2006 1973
1 BSc 1999 1977
2 Ph.D. 1995 1962
2 MBA 2000 1970
2 B.A. 1987 1965
3 Bachelor of Engineering 2005 1983
4 AB 1997 1975
4 Doctor of Philosophy (PhD) 2003 1970
I have already created the system that calculates the year of birth of each individual, but I cannot figure how to create a priority system so that it picks the correct year of birth as there can be a different one calculated for each degree the individual has.我已经创建了计算每个人出生年份的系统,但我无法弄清楚如何创建一个优先级系统,以便它选择正确的出生年份,因为可以为个人的每个学位计算不同的出生年份。 I want the following prioritization: Bachelor's year of birth > PhD year of birth > Master's year of birth.我想要以下优先顺序:学士的出生年份 > 博士的出生年份 > 硕士的出生年份。
I have tried numerous things with the groupby function and the Categorial datatype.我已经尝试了许多使用 groupby 函数和 Categorial 数据类型的方法。 Also, there are hundreds of different forms the degrees are written in within the dataset so I have been depending on using regular expressions to both calculate the year of birth and create the prioritization system.此外,数据集中有数百种不同的学位形式,因此我一直依赖于使用正则表达式来计算出生年份并创建优先级系统。 This is what I have currently, but I cannot find a way to implement regex into this:这就是我目前所拥有的,但我找不到一种方法来实现正则表达式:
category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"
file['edu_degree'] = pd.Categorical(file['edu_degree'], ordered=True, categories=[category1, category2, category3])
file.groupby('person_id')['edu_degree'].transform('max')
Also, this would be my desired output (year of births are replaced according to priority):此外,这将是我想要的输出(根据优先级替换出生年份):
person_id degree degree_completion year_of_birth
1 PhD 2006 1977
1 BSc 1999 1977
2 Ph.D. 1995 1965
2 MBA 2000 1965
2 B.A. 1987 1965
3 Bachelor of Engineering 2005 1983
4 AB 1997 1975
4 Doctor of Philosophy (PhD) 2003 1975
Here's an idea, probably not the most elegant one (assumption is that your frame is named df
):这是一个想法,可能不是最优雅的想法(假设您的框架名为df
):
import re
re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
+ r"(PhD|Ph\.D\.|Doctor.*)|"
+ r"(MBA|Master.*))$")
def category(match):
for i, group in enumerate(match.group(1, 2, 3), start=1):
if group:
return str(i)
df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)
df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
.year_of_birth.first())
df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')
Some explanations:一些解释:
Step 1 : A bit of reorganizing of your regex.第 1 步:对您的正则表达式进行一些重组。
import re
re_category = re.compile(r"^(?:(Bachelor.*|BSc|B\.A\.|AB|A\.B\.|A\.B|S\.B\.)|"
+ r"(PhD|Ph\.D\.|Doctor.*)|"
+ r"(MBA|Master.*))$")
I've adjusted the patterns such that (1) the complete entry of column degree
is matched, (2) included more possibities, and (3) escaped the .
我已经调整了模式,以便 (1) 列degree
的完整条目匹配,(2) 包含更多可能性,以及 (3) 转义.
s. s。 It's likely that you have to adjust it further!您可能需要进一步调整它! And I packed the categories in groups and concatenated them by |
我将类别打包成组并通过|
将它们连接起来|
. .
Step 2 : Creation of a degree_cat
column (= category of the respective degree).第 2 步:创建degree_cat
列(= 相应学位的类别)。
def category(match):
for i, group in enumerate(match.group(1, 2, 3), start=1):
if group:
return str(i)
df['degree_cat'] = df.degree.str.strip().str.replace(re_category, category).astype(int)
I've used category
as a repl
-function which essentially replaces matches with their category.我使用category
作为repl
函数,它基本上用它们的类别替换匹配项。 Check here how that works. 在这里检查它是如何工作的。 The strip
is just a precaution. strip
只是一种预防措施。 The resulting column for your sample looks like:您的示例的结果列如下所示:
0 2
1 1
2 2
3 3
4 1
5 1
6 1
7 2
Name: degree_cat, dtype: int64
Step 3 : Select the required years of birth.第 3 步:选择所需的出生年份。
df_year_of_birth = (df.sort_values('degree_cat').groupby('person_id', as_index=False)
.year_of_birth.first())
Here df
gets sorted by the new column, grouped by person_id
, and then the first item in year_of_birth
is selected (which is the required year due to the sorting).这里df
按新列排序,按person_id
分组,然后选择year_of_birth
的第一项(由于排序,这是所需的年份)。 Result for your sample:您的样品的结果:
person_id year_of_birth
0 1 1977
1 2 1965
2 3 1983
3 4 1975
Step 4 : Replace the values in year_of_birth
with the requiered values.第 4 步:将year_of_birth
的值替换为year_of_birth
的值。
df = df.drop(columns=['degree_cat','year_of_birth']).merge(df_year_of_birth, on='person_id')
Drop the old year_of_birth
and the degree_cat
column because they aren't needed any more, and the merge the df_year_of_birth
dataframe on df
along person_id
to recreate the right year_of_birth
column.删除旧的year_of_birth
和degree_cat
列,因为不再需要它们,并沿person_id
合并df
上的df_year_of_birth
数据person_id
以重新创建正确的year_of_birth
列。
End result:最终结果:
person_id degree degree_completion year_of_birth
0 1 PhD 2006 1977
1 1 BSc 1999 1977
2 2 Ph.D. 1995 1965
3 2 MBA 2000 1965
4 2 B.A. 1987 1965
5 3 Bachelor of Engineering 2005 1983
6 4 AB 1997 1975
7 4 Doctor of Philosophy (PhD) 2003 1975
This is one possible solution, maybe not the most elegant one but still does its job这是一种可能的解决方案,也许不是最优雅的解决方案,但仍然可以发挥作用
# define custom function to get the correct years
def find_best_year(df):
cond1 = df['degree'].str.match(category1)
cond2 = df['degree'].str.match(category2)
cond3 = df['degree'].str.match(category3)
if cond1.any():
return df.loc[cond1, 'year_of_birth']
elif cond2.any():
return df.loc[cond2, 'year_of_birth']
elif cond3.any():
return df.loc[cond3, 'year_of_birth']
else:
raise ValueError("No condition was found.")
# create lookup table with best years
lookup_df = file\
.groupby('person_id')\
.apply(find_best_year)\
.reset_index()\
.drop(columns=['level_1'])
print(lookup_df)
# person_id year_of_birth
# 0 1 1977
# 1 2 1965
# 2 3 1983
# 3 4 1975
# desired output
file\
.drop(columns=['year_of_birth'])\
.merge(lookup_df, on='person_id', how='left')
# person_id degree degree_completion year_of_birth
# 0 1 PhD 2006 1977
# 1 1 BSc 1999 1977
# 2 2 Ph.D. 1995 1965
# 3 2 MBA 2000 1965
# 4 2 B.A. 1987 1965
# 5 3 Bachelor of Engineering 2005 1983
# 6 4 AB 1997 1975
# 7 4 Doctor of Philosophy (PhD) 2003 1975
To apply the regex, you can make a function ( get_diploma
) to test them one after the other.要应用正则表达式,您可以创建一个函数 ( get_diploma
) 来一个接一个地测试它们。 Ideally in the order of the most probable ones (Bachelor first).理想情况下,按照最可能的顺序(学士学位优先)。
Then you can group by person_id and find the line with highest priority ( get_expected_age
function).然后您可以按 person_id 分组并找到具有最高优先级的行( get_expected_age
函数)。
import re
category1 = "^B[a-z]*|AB|A.B.|A.B|S.B."
category2 = "^P[a-z]*|Doctor of Philosophy[a-z]*"
category3 = "^M[a-z]*|Master[a-z]*"
diplomas = {category1: 'Bachelor', category2: 'PhD', category3: 'Master'}
ages = {'PhD': 33, 'Master': 30, 'Bachelor': 22}
def get_diploma(s):
# for first matching regexp, return diploma
for k in diplomas:
if re.match(k, s):
return diplomas[k]
df['degree_standardized'] = pd.Categorical(df['degree'].map(get_diploma),
ordered=True,
categories=['Master', 'PhD', 'Bachelor'])
# map the age from the standardized degree. NB. this could be fused with the previous step.
df['expected_age'] = df['degree_standardized'].map(ages)
def get_expected_age(d):
# get degree with highest priority
s = d.sort_values(by='degree_standardized').iloc[-1]
d['year_of_birth'] = s['degree_completion']-s['expected_age']
return d
df.groupby('person_id').apply(get_expected_age)
output:输出:
person_id degree degree_completion year_of_birth degree_standardized expected_age
0 1 PhD 2006 1977 PhD 33
1 1 BSc 1999 1977 Bachelor 22
2 2 Ph.D. 1995 1965 PhD 33
3 2 MBA 2000 1965 Master 30
4 2 B.A. 1987 1965 Bachelor 22
5 3 Bachelor of Engineering 2005 1983 Bachelor 22
6 4 AB 1997 1975 Bachelor 22
7 4 Doctor of Philosophy (PhD) 2003 1975 PhD 33
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.