简体   繁体   English

根据另一列中的条件查找一列中唯一值的总数

[英]Find total number of unique values in a column based on condition in another column

I need to count the number of unique sender_id that has sale in the ad_type column. 我需要计算ad_type列中有销售的唯一sender_id的数量。 The ad_type column has three values, rental , sharing and sale . ad_type列具有三个值,分别是rentalsharedsale

This counting is associated with a few conditions: 此计数与以下条件相关:

  1. The sender_id must record other values before the sale appears in the ad_type column to be included in the counting ie rental , rental , sale sender_id必须记录其他值,然后销售才会出现在ad_type列中,以包括在计数中,例如, 租金租金销售
  2. If the sender_id only have sale recorded and no other values before that ie sale , this should not be included in the counting 如果只SENDER_ID记录了销售销售 ,即之前没有其它的值,这不应该被包含在计数

To achieve this, I was thinking that I could tag these rows that met the conditions, create another column and then I can just use sum on that column. 为此,我想我可以标记满足条件的这些行,创建另一列,然后在该列上使用sum。

This is what I have tried to tag the rows. 这就是我试图标记行的方式。

Example df: df示例:

sender_id     reply_date    ad_type     
1234          2016-05-16    sharing
1234          2017-06-20    sale
3333          2016-05-16    rental
3333          2016-06-20    sale
3333          2016-06-21    sale
6767          2016-05-16    sale
0101          2016-04-16    sale
0101          2016-04-17    sale
9999          2016-01-01    rental
9999          2017-01-19    sharing
9999          2018-04-17    sale

I've tried where. 我在哪里尝试过。

df['count'] = df['ad_type'].where(df['ad_type'] == 'sale')

And: 和:

df['count'] = df.groupby(level=0)['ad_type'].transform(lambda x: x == 'sale')

The idea is that, if I can get this tagging process correct in this count column then I can count the unique sender_id by counting how many yes I have in the count column. 想法是,如果我可以在此count列中正确地执行此标记过程,则可以通过计算count列中有多少个yes来对唯一的sender_id进行计数。

Based on this attempt, the resulting df should look like this: 基于此尝试,生成的df应该如下所示:

sender_id     reply_date    ad_type    count    
1234          2016-05-16    sharing
1234          2017-06-20    sale       yes
3333          2016-05-16    rental
3333          2016-06-20    sale
3333          2016-06-21    sale       yes
6767          2016-05-16    sale
0101          2016-04-16    sale
0101          2016-04-17    sale
9999          2016-01-01    rental
9999          2017-01-19    sharing
9999          2018-04-17    sale       yes

Would appreciate some guidance on what seems to be a complicated task for me. 希望对我觉得比较复杂的任务提供一些指导。

Use numpy.where with chaining 3 boolean mask by & for bitwise AND : numpy.where&链接的3个布尔值掩码一起用于按位AND

m = df['ad_type'] == 'sale'
#get groups with values before sale
vals = df.loc[m.groupby(df['sender_id']).cumsum() == 0, 'sender_id'].unique()
m1 = df['sender_id'].isin(vals)
#get last duplicated value per groups - for last sale
m2 = ~df.loc[m, 'sender_id'].duplicated(keep='last').reindex(df.index, fill_value=False)

df['count'] = np.where(m & m1 & m2, 'yes', '')
print (df)
    sender_id  reply_date  ad_type count
0        1234  2016-05-16  sharing      
1        1234  2017-06-20     sale   yes
2        3333  2016-05-16   rental      
3        3333  2016-06-20     sale      
4        3333  2016-06-21     sale   yes
5        6767  2016-05-16     sale      
6         101  2016-04-16     sale      
7         101  2016-04-17     sale      
8        9999  2016-01-01   rental      
9        9999  2017-01-19  sharing      
10       9999  2018-04-17     sale   yes

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM