简体   繁体   中英

Find max number of consecutive months a customer has been active in pandas

I am trying to find the max number of consecutive months a customer is active with the store. Here is the data I have.

+-----+---------+--------------+--------+-------------+
| Id  | t_year  | t_month_prx  | Store  | diff_months |
+-----+---------+--------------+--------+-------------+
|  1  |   2021  |        10.0  | A001   |         1.0 |
|  1  |   2022  |         1.0  | A001   |         1.0 |
|  1  |   2022  |         2.0  | A001   |         1.0 |
|  2  |   2021  |         1.0  | A001   |         1.0 |
|  2  |   2021  |         2.0  | A001   |         1.0 |
|  2  |   2021  |         3.0  | A001   |         1.0 |
|  2  |   2021  |         6.0  | A001   |         3.0 |
|  2  |   2021  |         7.0  | A001   |         1.0 |
|  2  |   2021  |         8.0  | A001   |         1.0 |
|  2  |   2021  |         9.0  | A001   |         1.0 |
|  2  |   2021  |        10.0  | A001   |         1.0 |
|  2  |   2022  |         1.0  | A001   |         1.0 |
|  2  |   2022  |         2.0  | A001   |         1.0 |
|  2  |   2021  |         1.0  | A002   |         1.0 |
|  2  |   2021  |         2.0  | A002   |         1.0 |
|  2  |   2021  |         3.0  | A002   |         1.0 |
|  2  |   2021  |         6.0  | A002   |         3.0 |
|  2  |   2021  |         7.0  | A002   |         1.0 |
|  2  |   2021  |         8.0  | A002   |         1.0 |
|  2  |   2021  |         9.0  | A002   |         1.0 |
|  2  |   2021  |        10.0  | A002   |         1.0 |
|  3  |   2021  |        10.0  | A002   |         1.0 |
|  3  |   2022  |         1.0  | A002   |         1.0 |
|  3  |   2022  |         2.0  | A002   |         1.0 |
+-----+---------+--------------+--------+-------------+

Original problem involved skipping two months so I made a month proxy. So, instead of 12 months there are 10 months in a year.

what I have tried so far is:

df = df.sort_values(by=['Id','t_year','t_month_prx'], ascending = True).reset_index(drop=True)
df['diff_months'] = df.groupby(['Id', 't_year'])['t_month_prx'].diff()
df['diff_months'].fillna(method='bfill', inplace=True)

and I get this result

df_result = pd.DataFrame({
'Id': {0: 1,1: 1,2: 1,3: 2,4: 2,5: 2,6: 2,7: 2, 8: 2,9: 2, 10: 2, 11: 2, 12: 2},
't_year': {0: 2021, 1: 2022, 2: 2022, 3: 2021,4: 2021,5: 2021,6: 2021,7: 2021,8: 2021,9: 2021,10: 2021,11: 2022,12: 2022},
 't_month_prx': {0: 10.0,1: 1.0,2: 2.0,3: 1.0,4: 2.0,5: 3.0,6: 6.0,7: 7.0,
 8: 8.0,9: 9.0,10: 10.0,11: 1.0,12: 2.0},
'diff_months': {0: 1.0, 1: 1.0, 2: 1.0,3: 1.0,4: 1.0,5: 1.0,6: 3.0,7: 1.0,8: 1.0,9: 1.0,10: 1.0,11: 1.0, 12: 1.0}
})

then finally I tired to count all consecutive 1s

df.groupby([df['Id'], df['diff_months'].ne(df.groupby('Id')['diff_months'].shift(1)).cumsum()])['diff_months'].sum().groupby(level=0).max().reset_index(name='consecutive_month')

It gives me following results

pd.DataFrame({
'Id': {0: 1,1: 2},
'counts': {0: 3.0,1: 6.0}
})

but desired output is:

pd.DataFrame({'Id': [1,2, 2, 3], 'Store': ['A001','A001', 'A002', 'A002'], 'counts': [3, 7, 5, 3]})

which looks like:

   Id Store  counts
0   1  A001       3
1   2  A001       7
2   2  A002       5
3   3  A002       3

So, for 2nd customer it should be 7 months, since I am counting only 1s it skips 3 . similarly there can be multiple smaller sequences of 1's and in that case will have to select the max count of 1's. Is my approach good? Any idea how I can count consecutive months that can span over different years?

You could first create groups using groupby + diff (in any given year, to be "consecutive", the difference has to be 1; across years, it has to be -9). Then use the groups in another groupby + size to find the consecutive counts; then do yet another groupby + max to find the maximum consecutive counts per Id.

cols = ['Id', 'Store']
g = df.groupby(cols)
month_diff = g['t_month_prx'].diff()
year_diff = g['t_year'].diff()
nonconsecutive = ~((year_diff.eq(0) & month_diff.eq(1)) | (year_diff.eq(1) & month_diff.eq(-9)))
out = df.groupby([*cols, nonconsecutive.cumsum()]).size().droplevel(-1).groupby(cols).max().reset_index(name='counts')

Output:

   Id Store  counts
0   1  A001       3
1   2  A001       7
2   2  A002       5
3   3  A002       3

this code works with the given example dataset:

df = df.sort_values(['Id','t_year','t_month_prx'])

gr = (df.groupby(['Store','Id']).
      apply(lambda x: (~x['t_month_prx'].diff().isin([1,-9])).cumsum()).
      reset_index(name='num'))
res = (gr.groupby(['Store','Id']).
       apply(lambda x: x.num.value_counts().max()).
       reset_index(name='counts'))

print(res)
'''
  Store  Id  counts
0  A001   1       3
1  A001   2       7
2  A002   2       5
3  A002   3       3

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM