[英]Find total number of unique values in a column based on condition in another column
I need to count the number of unique sender_id that has sale in the ad_type column. 我需要计算ad_type列中有销售的唯一sender_id的数量。 The ad_type column has three values, rental , sharing and sale .
ad_type列具有三个值,分别是rental , shared和sale 。
This counting is associated with a few conditions: 此计数与以下条件相关:
To achieve this, I was thinking that I could tag these rows that met the conditions, create another column and then I can just use sum on that column. 为此,我想我可以标记满足条件的这些行,创建另一列,然后在该列上使用sum。
This is what I have tried to tag the rows. 这就是我试图标记行的方式。
Example df: df示例:
sender_id reply_date ad_type
1234 2016-05-16 sharing
1234 2017-06-20 sale
3333 2016-05-16 rental
3333 2016-06-20 sale
3333 2016-06-21 sale
6767 2016-05-16 sale
0101 2016-04-16 sale
0101 2016-04-17 sale
9999 2016-01-01 rental
9999 2017-01-19 sharing
9999 2018-04-17 sale
I've tried where. 我在哪里尝试过。
df['count'] = df['ad_type'].where(df['ad_type'] == 'sale')
And: 和:
df['count'] = df.groupby(level=0)['ad_type'].transform(lambda x: x == 'sale')
The idea is that, if I can get this tagging process correct in this count column then I can count the unique sender_id by counting how many yes I have in the count column. 想法是,如果我可以在此count列中正确地执行此标记过程,则可以通过计算count列中有多少个yes来对唯一的sender_id进行计数。
Based on this attempt, the resulting df should look like this: 基于此尝试,生成的df应该如下所示:
sender_id reply_date ad_type count
1234 2016-05-16 sharing
1234 2017-06-20 sale yes
3333 2016-05-16 rental
3333 2016-06-20 sale
3333 2016-06-21 sale yes
6767 2016-05-16 sale
0101 2016-04-16 sale
0101 2016-04-17 sale
9999 2016-01-01 rental
9999 2017-01-19 sharing
9999 2018-04-17 sale yes
Would appreciate some guidance on what seems to be a complicated task for me. 希望对我觉得比较复杂的任务提供一些指导。
Use numpy.where
with chaining 3 boolean mask by &
for bitwise AND
: 将
numpy.where
与&
链接的3个布尔值掩码一起用于按位AND
:
m = df['ad_type'] == 'sale'
#get groups with values before sale
vals = df.loc[m.groupby(df['sender_id']).cumsum() == 0, 'sender_id'].unique()
m1 = df['sender_id'].isin(vals)
#get last duplicated value per groups - for last sale
m2 = ~df.loc[m, 'sender_id'].duplicated(keep='last').reindex(df.index, fill_value=False)
df['count'] = np.where(m & m1 & m2, 'yes', '')
print (df)
sender_id reply_date ad_type count
0 1234 2016-05-16 sharing
1 1234 2017-06-20 sale yes
2 3333 2016-05-16 rental
3 3333 2016-06-20 sale
4 3333 2016-06-21 sale yes
5 6767 2016-05-16 sale
6 101 2016-04-16 sale
7 101 2016-04-17 sale
8 9999 2016-01-01 rental
9 9999 2017-01-19 sharing
10 9999 2018-04-17 sale yes
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.