pandas：根据列值在df中查找事件的第一个事件并标记为新的列值

Question

I have a dataframe which looks like this: 我有一个如下所示的数据框：

customer_id event_date data 
1           2012-10-18    0      
1           2012-10-12    0      
1           2015-10-12    0      
2           2012-09-02    0      
2           2013-09-12    1      
3           2010-10-21    0      
3           2013-11-08    0      
3           2013-12-07    1     
3           2015-09-12    1

I wish to add additional columns, such as 'flag_1' & 'flag_2' below, which allow myself (and other when I pass on the amended data) to filter easily. 我希望添加其他列，例如下面的'flag_1'和'flag_2'，它允许我自己（以及其他我传递修改后的数据时）轻松过滤。

Flag_1 is an indication of the first appearance of that customer in the data set. Flag_1表示该客户在数据集中的首次出现。 I have implemented this successfully by sorting: dta.sort_values(['customer_id','event_date']) and then using: dta.duplicated(['customer_id']).astype(int) 我通过排序成功实现了这个： dta.sort_values(['customer_id','event_date'])然后使用： dta.duplicated(['customer_id']).astype(int)

Flag_2 would be an indication of the first incidence of each customer when the column 'data' = 1. 当列'数据'= 1时，Flag_2将指示每个客户的第一次发生。

An example of what the additional columns implemented would look like below: 实现的附加列的示例如下所示：

customer_id event_date data flag_1 flag_2
1           2012-10-18    0      1      0
1           2012-10-12    0      0      0
1           2015-10-12    0      0      0
2           2012-09-02    0      1      0
2           2013-09-12    1      0      1
3           2010-10-21    0      1      0
3           2013-11-08    0      0      0
3           2013-12-07    1      0      1
3           2015-09-12    1      0      0

I am new to pandas and unsure how to implement the 'flag_2' column without iterating over the entire dataframe - I presume there is a quicker way to implement using inbuilt function but haven't found any posts? 我是pandas的新手并不确定如何实现'flag_2'列而不迭代整个数据帧 - 我认为有一种更快的方法来实现使用内置函数但没有找到任何帖子？

Thanks 谢谢

Answer 1

First initialize empty flags. 首先初始化空标志。 Use groupby to get the groups based on the customer_id . 使用groupby基于customer_id获取组。 For the first flag, use loc to set the value of flag1 for the first value in each group. 对于第一个标志，使用loc为每个组中的第一个值设置flag1的值。 Use the same strategy for flag2 , but first filter for cases where data has been set to one. 对flag2使用相同的策略，但首先过滤data已设置为1的情况。

# Initialize empty flags
df['flag1'] = 0
df['flag2'] = 0

# Set flag1
groups = df.groupby('customer_id').groups
df.loc[[values[0] for values in groups.values()], 'flag1'] = 1

# Set flag2
groups2 = df.loc[df.data == 1, :].groupby('customer_id').groups
df.loc[[values[0] for values in groups2.values()], 'flag2'] = 1

>>> df
   customer_id  event_date  data  flag1  flag2
0            1  2012-10-18     0      1      0
1            1  2012-10-12     0      0      0
2            1  2015-10-12     0      0      0
3            2  2012-09-02     0      1      0
4            2  2013-09-12     1      0      1
5            3  2010-10-21     0      1      0
6            3  2013-11-08     0      0      0
7            3  2013-12-07     1      0      1
8            3  2015-09-12     1      0      0

pandas：根据列值在df中查找事件的第一个事件并标记为新的列值

问题描述

1 个解决方案

解决方案1
3 已采纳 2016-02-18 15:09:42

pandas：根据列值在df中查找事件的第一个事件并标记为新的列值

问题描述

1 个解决方案

解决方案1 3 已采纳 2016-02-18 15:09:42

解决方案1
3 已采纳 2016-02-18 15:09:42