根据另一列中的条件查找一列中唯一值的总数

Question

I need to count the number of unique sender_id that has sale in the ad_type column. 我需要计算ad_type列中有销售的唯一sender_id的数量。 The ad_type column has three values, rental , sharing and sale . ad_type列具有三个值，分别是rental ， shared和sale 。

This counting is associated with a few conditions: 此计数与以下条件相关：

The sender_id must record other values before the sale appears in the ad_type column to be included in the counting ie rental , rental , sale sender_id必须记录其他值，然后销售才会出现在ad_type列中，以包括在计数中，例如，租金，租金，销售
If the sender_id only have sale recorded and no other values before that ie sale , this should not be included in the counting 如果只SENDER_ID记录了销售和销售，即之前没有其它的值，这不应该被包含在计数

To achieve this, I was thinking that I could tag these rows that met the conditions, create another column and then I can just use sum on that column. 为此，我想我可以标记满足条件的这些行，创建另一列，然后在该列上使用sum。

This is what I have tried to tag the rows. 这就是我试图标记行的方式。

Example df: df示例：

sender_id     reply_date    ad_type     
1234          2016-05-16    sharing
1234          2017-06-20    sale
3333          2016-05-16    rental
3333          2016-06-20    sale
3333          2016-06-21    sale
6767          2016-05-16    sale
0101          2016-04-16    sale
0101          2016-04-17    sale
9999          2016-01-01    rental
9999          2017-01-19    sharing
9999          2018-04-17    sale

I've tried where. 我在哪里尝试过。

df['count'] = df['ad_type'].where(df['ad_type'] == 'sale')

And: 和：

df['count'] = df.groupby(level=0)['ad_type'].transform(lambda x: x == 'sale')

The idea is that, if I can get this tagging process correct in this count column then I can count the unique sender_id by counting how many yes I have in the count column. 想法是，如果我可以在此count列中正确地执行此标记过程，则可以通过计算count列中有多少个yes来对唯一的sender_id进行计数。

Based on this attempt, the resulting df should look like this: 基于此尝试，生成的df应该如下所示：

sender_id     reply_date    ad_type    count    
1234          2016-05-16    sharing
1234          2017-06-20    sale       yes
3333          2016-05-16    rental
3333          2016-06-20    sale
3333          2016-06-21    sale       yes
6767          2016-05-16    sale
0101          2016-04-16    sale
0101          2016-04-17    sale
9999          2016-01-01    rental
9999          2017-01-19    sharing
9999          2018-04-17    sale       yes

Would appreciate some guidance on what seems to be a complicated task for me. 希望对我觉得比较复杂的任务提供一些指导。

Answer 1

Use numpy.where with chaining 3 boolean mask by & for bitwise AND : 将numpy.where与&链接的3个布尔值掩码一起用于按位AND ：

m = df['ad_type'] == 'sale'
#get groups with values before sale
vals = df.loc[m.groupby(df['sender_id']).cumsum() == 0, 'sender_id'].unique()
m1 = df['sender_id'].isin(vals)
#get last duplicated value per groups - for last sale
m2 = ~df.loc[m, 'sender_id'].duplicated(keep='last').reindex(df.index, fill_value=False)

df['count'] = np.where(m & m1 & m2, 'yes', '')
print (df)
    sender_id  reply_date  ad_type count
0        1234  2016-05-16  sharing      
1        1234  2017-06-20     sale   yes
2        3333  2016-05-16   rental      
3        3333  2016-06-20     sale      
4        3333  2016-06-21     sale   yes
5        6767  2016-05-16     sale      
6         101  2016-04-16     sale      
7         101  2016-04-17     sale      
8        9999  2016-01-01   rental      
9        9999  2017-01-19  sharing      
10       9999  2018-04-17     sale   yes

根据另一列中的条件查找一列中唯一值的总数

问题描述

1 个解决方案

解决方案1
1 已采纳 2019-01-18 11:23:15

根据另一列中的条件查找一列中唯一值的总数

问题描述

1 个解决方案

解决方案1 1 已采纳 2019-01-18 11:23:15

解决方案1
1 已采纳 2019-01-18 11:23:15