简体   繁体   English

根据多个组合条件创建新列

[英]Create new column based on multiple groupby conditions

I want a new column in this df with the following condition. 我想在这个df中使用以下条件的新列。 The column education is a categorical value that goes from 1 to 5 (1 is the lower level of education and 5 is the higher level of education). 专栏education是一个从1到5的分类值(1是较低的教育水平,5是较高的教育水平)。 I want to create a function with the following logic (so as to create a new column in the df) 我想用以下逻辑创建一个函数(以便在df中创建一个新列)

First, for any id check if there is at least a education level graduated, then the new column must have the higher level of education graduated. 首先,对于任何身份检查,如果至少有一个毕业的教育水平,那么新专栏必须具有更高的教育水平。

Second, if there is no graduated education level for some particular id (must have all educaction level in "In course"). 第二,如果某些特定身份证没有毕业教育水平(必须在“课程中”具有所有教育水平)。 So, must check the maximium level of education and substract one. 因此,必须检查最高教育水平并减去一个。

df
id  education stage
1   2         Graduated
1   3         Graduated
1   4         In course
2   3         In course
3   2         Graduated
3   3         In course
4   2         In course

expected output: 预期产量:

id  education stage       new_column
1   2         Graduated   3
1   3         Graduated   3
1   4         In course   3
2   3         In course   2
3   2         Graduated   2
3   3         In course   2
4   2         In course   1

You can do it like this: 你可以这样做:

import pandas as pd
df = pd.DataFrame({'id': [1, 1, 1, 2, 3, 3, 4], 'education': [2, 3, 4, 3, 2, 3, 2],
                   'stage': ['Graduated', 'Graduated', 'In course', 'In course', 'Graduated', 'In course', 'In course']})


max_gr = df[df.stage == 'Graduated'].groupby('id').education.max()
max_ic = df[df.stage == 'In course'].groupby('id').education.max()

# set all cells to the value from max_ed
df['new_col'] = df.id.map(max_gr)
# set cells that have not been filled to the value from max_ic - 1
df.loc[df.new_col.isna(), ['new_col']] = df.id.map(max_ic - 1)

series.map(other_series) returns a new series where the values from series have been replaced by the values from other_series . series.map(other_series)返回一个新系列,其中来自series的值已被other_series的值替换。

This is one way. 这是一种方式。

df['new'] = df.loc[df['stage'] == 'Graduated']\
              .groupby('id')['education']\
              .transform(max).astype(int)

df['new'] = df['new'].fillna(df.loc[df['stage'] == 'InCourse']\
                               .groupby('id')['education']\
                               .transform(max).sub(1)).astype(int)

Result 结果

   id  education      stage  new
0   1          2  Graduated    3
1   1          3  Graduated    3
2   1          4   InCourse    3
3   2          3   InCourse    2
4   3          2  Graduated    2
5   3          3   InCourse    2
6   4          2   InCourse    1

Explanation 说明

  • First, map to "Graduated" dataset grouped by id on max education. 首先,映射到最大教育中按ID分组的“分级”数据集。
  • Second, map to "InCourse" dataset grouped by id on max education minus 1. 其次,映射到“InCourse”数据集,按最大教育减去1的id分组。

Alternative solution based on Markus Löffler. 基于MarkusLöffler的替代解决方案

max_ic = df[df.stage.eq('In course')].groupby('id').education.max() - 1
max_gr = df[df.stage.eq('Graduated')].groupby('id').education.max()

# Update with max_gr
max_ic.update(max_gr)

df['new_col'] = df.id.map(max_ic)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM