[英]pandas: finding first incidences of events in df based on column values and marking as new column values
I have a dataframe which looks like this: 我有一个如下所示的数据框:
customer_id event_date data
1 2012-10-18 0
1 2012-10-12 0
1 2015-10-12 0
2 2012-09-02 0
2 2013-09-12 1
3 2010-10-21 0
3 2013-11-08 0
3 2013-12-07 1
3 2015-09-12 1
I wish to add additional columns, such as 'flag_1' & 'flag_2' below, which allow myself (and other when I pass on the amended data) to filter easily. 我希望添加其他列,例如下面的'flag_1'和'flag_2',它允许我自己(以及其他我传递修改后的数据时)轻松过滤。
Flag_1 is an indication of the first appearance of that customer in the data set. Flag_1表示该客户在数据集中的首次出现。 I have implemented this successfully by sorting:
dta.sort_values(['customer_id','event_date'])
and then using: dta.duplicated(['customer_id']).astype(int)
我通过排序成功实现了这个:
dta.sort_values(['customer_id','event_date'])
然后使用: dta.duplicated(['customer_id']).astype(int)
Flag_2 would be an indication of the first incidence of each customer when the column 'data' = 1. 当列'数据'= 1时,Flag_2将指示每个客户的第一次发生。
An example of what the additional columns implemented would look like below: 实现的附加列的示例如下所示:
customer_id event_date data flag_1 flag_2
1 2012-10-18 0 1 0
1 2012-10-12 0 0 0
1 2015-10-12 0 0 0
2 2012-09-02 0 1 0
2 2013-09-12 1 0 1
3 2010-10-21 0 1 0
3 2013-11-08 0 0 0
3 2013-12-07 1 0 1
3 2015-09-12 1 0 0
I am new to pandas and unsure how to implement the 'flag_2' column without iterating over the entire dataframe - I presume there is a quicker way to implement using inbuilt function but haven't found any posts? 我是pandas的新手并不确定如何实现'flag_2'列而不迭代整个数据帧 - 我认为有一种更快的方法来实现使用内置函数但没有找到任何帖子?
Thanks 谢谢
First initialize empty flags. 首先初始化空标志。 Use
groupby
to get the groups based on the customer_id
. 使用
groupby
基于customer_id
获取组。 For the first flag, use loc
to set the value of flag1
for the first value in each group. 对于第一个标志,使用
loc
为每个组中的第一个值设置flag1
的值。 Use the same strategy for flag2
, but first filter for cases where data
has been set to one. 对
flag2
使用相同的策略,但首先过滤data
已设置为1的情况。
# Initialize empty flags
df['flag1'] = 0
df['flag2'] = 0
# Set flag1
groups = df.groupby('customer_id').groups
df.loc[[values[0] for values in groups.values()], 'flag1'] = 1
# Set flag2
groups2 = df.loc[df.data == 1, :].groupby('customer_id').groups
df.loc[[values[0] for values in groups2.values()], 'flag2'] = 1
>>> df
customer_id event_date data flag1 flag2
0 1 2012-10-18 0 1 0
1 1 2012-10-12 0 0 0
2 1 2015-10-12 0 0 0
3 2 2012-09-02 0 1 0
4 2 2013-09-12 1 0 1
5 3 2010-10-21 0 1 0
6 3 2013-11-08 0 0 0
7 3 2013-12-07 1 0 1
8 3 2015-09-12 1 0 0
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.