[英]Python dataframe assign values based on another column with condition
有一个包含日期、名称和编号列的 df。 如果计数> 3,则尝试标记相同的名称记录。 并且应该将状态标记为 old_employee 到最早的日期。
date name number
2021-05-06T07:35:03.000Z mark 123
2021-04-06T07:35:03.000Z mark 123
2021-03-03T07:35:03.000Z mark 123
2021-02-03T07:35:03.000Z mark 123
2021-05-06T07:35:03.000Z tom 4123
2021-04-06T07:35:03.000Z tom 4123
2021-03-03T07:35:03.000Z tom 4123
2021-02-06T07:35:03.000Z john 512
2021-02-06T07:35:03.000Z wood 512
2021-02-06T07:35:03.000Z wood 512
2020-05-06T07:35:03.000Z paul 723
2020-04-06T07:35:03.000Z paul 723
2020-03-03T07:35:03.000Z paul 723
2020-02-03T07:35:03.000Z paul 723
2020-02-03T05:35:03.000Z paul 723
2020-02-02T07:35:03.000Z paul 723
2020-02-01T07:35:03.000Z paul 723
2020-05-06T07:35:03.000Z tomy 623
2020-04-06T07:35:03.000Z tomy 623
2020-03-03T07:35:03.000Z tomy 623
2020-02-03T07:35:03.000Z tomy 623
2020-02-03T05:35:03.000Z tomy 623
2020-02-02T07:35:03.000Z tomy 623
如果同名记录超过 3 次,我们必须将最旧日期的记录标记为 old_employee。
预期 output:
date name number status
2021-05-06T07:35:03.000Z mark 123
2021-04-06T07:35:03.000Z mark 123
2021-03-03T07:35:03.000Z mark 123
2021-02-03T07:35:03.000Z mark 123 old_employee
2021-05-06T07:35:03.000Z tom 4123
2021-04-06T07:35:03.000Z tom 4123
2021-03-03T07:35:03.000Z tom 4123
2021-02-06T07:35:03.000Z john 512
2021-02-06T07:35:03.000Z wood 512
2021-02-06T07:35:03.000Z wood 512
2020-05-06T07:35:03.000Z paul 723
2020-04-06T07:35:03.000Z paul 723
2020-03-03T07:35:03.000Z paul 723
2020-02-03T07:35:03.000Z paul 723 old_employee
2020-02-03T05:35:03.000Z paul 723 old_employee
2020-02-02T07:35:03.000Z paul 723 old_employee
2020-02-01T07:35:03.000Z paul 723 old_employee
2020-05-06T07:35:03.000Z tomy 623
2020-04-06T07:35:03.000Z tomy 623
2020-03-03T07:35:03.000Z tomy 623
2020-02-03T07:35:03.000Z tomy 623 old_employee
2020-02-03T05:35:03.000Z tomy 623 old_employee
2020-02-02T07:35:03.000Z tomy 623 old_employee
试过这个:
(df.groupby('name', as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-3]])
.reset_index(level=0, drop=True))
你可以试试,
df.loc[df.groupby('name').cumcount() >= 3, 'status'] = 'old_employee'
将GroupBy.cumcount
与numpy.where
一起使用,并通过Series.ge
进行比较以获得更大或相等:
df['date'] = pd.to_datetime(df['date'])
#if values not sorted by name and dates
#df = df.sort_values(['name','date'])
df['status'] = np.where(df.groupby('name').cumcount().ge(3), 'old_employee', '')
print (df)
date name number status
0 2 021-05-06 07:35:03+00:00 mark 123
1 2 021-04-06 07:35:03+00:00 mark 123
2 2 021-03-03 07:35:03+00:00 mark 123
3 2 021-02-03 07:35:03+00:00 mark 123 old_employee
4 2 021-05-06 07:35:03+00:00 tom 4123
5 2 021-04-06 07:35:03+00:00 tom 4123
6 2 021-03-03 07:35:03+00:00 tom 4123
7 2 021-02-06 07:35:03+00:00 john 512
8 2 021-02-06 07:35:03+00:00 wood 512
9 2 021-02-06 07:35:03+00:00 wood 512
10 2 020-05-06 07:35:03+00:00 paul 723
11 2 020-04-06 07:35:03+00:00 paul 723
12 2 020-03-03 07:35:03+00:00 paul 723
13 2 020-02-03 07:35:03+00:00 paul 723 old_employee
14 2 020-02-03 05:35:03+00:00 paul 723 old_employee
15 2 020-02-02 07:35:03+00:00 paul 723 old_employee
16 2 020-02-01 07:35:03+00:00 paul 723 old_employee
17 2 020-05-06 07:35:03+00:00 tomy 623
18 2 020-04-06 07:35:03+00:00 tomy 623
19 2 020-03-03 07:35:03+00:00 tomy 623
20 2 020-02-03 07:35:03+00:00 tomy 623 old_employee
21 2 020-02-03 05:35:03+00:00 tomy 623 old_employee
22 2 020-02-02 07:35:03+00:00 tomy 623 old_employee
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.