简体   繁体   中英

How to count Pandas df elements with dynamic condition per row (=countif)

I am tyring to do some equivalent of COUNTIF in Pandas. I am trying to get my head around doing it with groupby , but I am struggling because my logical grouping condition is dynamic.

Say I have a list of customers, and the day on which they visited. I want to identify new customers based on 2 logical conditions

  • They must be the same customer (same Guest ID )
  • They must have been there on the previous day

If both conditions are met, they are a returning customer. If not, they are new (Hence newby = 1-... to identify new customers.

I managed to do this with a for loop, but obviously performance is terrible and this goes pretty much against the logic of Pandas.

How can I wrap the following code into something smarter than a loop?

for i in range (0, len(df)):
    newby = 1-np.sum((df["Day"] == df.iloc[i]["Day"]-1) & (df["Guest ID"] == df.iloc[i]["Guest ID"]))

This post does not help, as the condition is static. I would like to avoid introducting "dummy columns", such as transposing the df , because I will have many categories (many customer names) and would like to build more complex logical statements. I do not want to run the risk of ending up with many auxiliary columns

I have the following input

df
   Day     Guest ID
0  3230    Tom
1  3230    Peter
2  3231    Tom
3  3232    Peter
4  3232    Peter

and expect this output

df
   Day     Guest ID      newby
0  3230    Tom           1
1  3230    Peter         1
2  3231    Tom           0
3  3232    Peter         1
4  3232    Peter         1

Note that elements 3 and 4 are not necessarily duplicates - given there might be additional, varying columns (such as their order).

Do:

# ensure the df is sorted by date 
df = df.sort_values('Day')

# group by customer and find the diff within each group
df['newby'] = (df.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)
print(df)

Output

    Day Guest ID  newby
0  3230      Tom      1
1  3230    Peter      1
2  3231      Tom      0
3  3232    Peter      1

UPDATE

If multiple visits are allowed per day, you could do:

# only keep unique visits per day
uniques = df.drop_duplicates()

# ensure the df is sorted by date
uniques = uniques.sort_values('Day')

# group by customer and find the diff within each group
uniques['newby'] = (uniques.groupby('Guest ID')['Day'].transform('diff').fillna(2) > 1).astype(int)

# merge the uniques visits back into the original df
res = df.merge(uniques, on=['Day', 'Guest ID'])

print(res)

Output

    Day Guest ID  newby
0  3230      Tom      1
1  3230    Peter      1
2  3231      Tom      0
3  3232    Peter      1
4  3232    Peter      1

As an alternative, without sorting or merging, you could do:

lookup = {(day + 1, guest) for day, guest in df[['Day', 'Guest ID']].value_counts().to_dict()}
df['newby'] = (~pd.MultiIndex.from_arrays([df['Day'], df['Guest ID']]).isin(lookup)).astype(int)
print(df)

Output

    Day Guest ID  newby
0  3230      Tom      1
1  3230    Peter      1
2  3231      Tom      0
3  3232    Peter      1
4  3232    Peter      1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM