[英]Create new column based on multiple groupby conditions
I want a new column in this df with the following condition. 我想在这个df中使用以下条件的新列。 The column education
is a categorical value that goes from 1 to 5 (1 is the lower level of education and 5 is the higher level of education). 专栏education
是一个从1到5的分类值(1是较低的教育水平,5是较高的教育水平)。 I want to create a function with the following logic (so as to create a new column in the df) 我想用以下逻辑创建一个函数(以便在df中创建一个新列)
First, for any id check if there is at least a education level graduated, then the new column must have the higher level of education graduated. 首先,对于任何身份检查,如果至少有一个毕业的教育水平,那么新专栏必须具有更高的教育水平。
Second, if there is no graduated education level for some particular id (must have all educaction level in "In course"). 第二,如果某些特定身份证没有毕业教育水平(必须在“课程中”具有所有教育水平)。 So, must check the maximium level of education and substract one. 因此,必须检查最高教育水平并减去一个。
df
id education stage
1 2 Graduated
1 3 Graduated
1 4 In course
2 3 In course
3 2 Graduated
3 3 In course
4 2 In course
expected output: 预期产量:
id education stage new_column
1 2 Graduated 3
1 3 Graduated 3
1 4 In course 3
2 3 In course 2
3 2 Graduated 2
3 3 In course 2
4 2 In course 1
You can do it like this: 你可以这样做:
import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 2, 3, 3, 4], 'education': [2, 3, 4, 3, 2, 3, 2],
'stage': ['Graduated', 'Graduated', 'In course', 'In course', 'Graduated', 'In course', 'In course']})
max_gr = df[df.stage == 'Graduated'].groupby('id').education.max()
max_ic = df[df.stage == 'In course'].groupby('id').education.max()
# set all cells to the value from max_ed
df['new_col'] = df.id.map(max_gr)
# set cells that have not been filled to the value from max_ic - 1
df.loc[df.new_col.isna(), ['new_col']] = df.id.map(max_ic - 1)
series.map(other_series)
returns a new series where the values from series
have been replaced by the values from other_series
. series.map(other_series)
返回一个新系列,其中来自series
的值已被other_series
的值替换。
This is one way. 这是一种方式。
df['new'] = df.loc[df['stage'] == 'Graduated']\
.groupby('id')['education']\
.transform(max).astype(int)
df['new'] = df['new'].fillna(df.loc[df['stage'] == 'InCourse']\
.groupby('id')['education']\
.transform(max).sub(1)).astype(int)
Result 结果
id education stage new
0 1 2 Graduated 3
1 1 3 Graduated 3
2 1 4 InCourse 3
3 2 3 InCourse 2
4 3 2 Graduated 2
5 3 3 InCourse 2
6 4 2 InCourse 1
Explanation 说明
Alternative solution based on Markus Löffler. 基于MarkusLöffler的替代解决方案 。
max_ic = df[df.stage.eq('In course')].groupby('id').education.max() - 1
max_gr = df[df.stage.eq('Graduated')].groupby('id').education.max()
# Update with max_gr
max_ic.update(max_gr)
df['new_col'] = df.id.map(max_ic)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.