简体   繁体   中英

Comparing rows of string inside groupby and assigning a value to a new column pandas

I have a dataset of employees (their IDs) and the names of their bosses for several years.

df:

在此处输入图像描述

What I need to do is to see if an employee had a boss' change. So, desired output is:

在此处输入图像描述

For employees who appear in the df only once, I just assign 0 (no boss' change). However, I cannot figure out how to do it for the employees who are in the df for several years.

I was thinking that first I need to assign 0 for the first year they appear in the df (because we do not know who was the boss before, therefore there is no boss' change). Then I need to compare the name of the boss with the name in the next row and decide whether to assign 1 or 0 into the ManagerChange column.

So far I split the df into two (with unique IDs and duplicated IDs) and assigned 0 to ManagerChange for the unique IDs.

Then I groupby the duplicated IDs and sort them by year. However, I am new to Python and cannot figure out how to compare strings and assign a result value to a new column inside the groupby. Please, help.

Code I have so far:

# splitting database in two
bool_series = df["ID"].duplicated(keep=False)

df_duplicated=df[bool_series]

df_unique = df[~bool_series]

# assigning 0 for ManagerChange for the unique IDs
df_unique['ManagerChange'] = 0

# groupby by ID and sorting by year for the duplicated IDs
df_duplicated.groupby('ID').apply(lambda x: x.sort_values('Year'))

You can groupby then shift() the group and compare on Boss columns.

# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)

# Compare Boss column with shifted Boss column
df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1)).tolist()

# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})

# Sort df to original df
df = df.sort_index()

# Change the first in each group to 0
df.loc[df.groupby('ID').head(1).index, 'ManagerChange'] = 0
# print(df)

     ID  Year     Boss  ManagerChange
0  1234  2018     Anna              0
1   567  2019    Sarah              0
2  1234  2020  Michael              0
3  8976  2019     John              0
4  1234  2019  Michael              1
5  8976  2020     John              0

You could also make use of fill_value argument, this will help you get rid of the last df.loc[] operation.

# Sort value first
df.sort_values(['ID', 'Year'], inplace=True)

df['ManagerChange'] = df.groupby('ID').apply(lambda group: group['Boss'] != group['Boss'].shift(1, fill_value=group['Boss'].iloc[0])).tolist()

# Change True to 1, False to 0
df['ManagerChange'] = df['ManagerChange'].map({True: 1, False: 0})

# Sort df to original df
df = df.sort_index()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM