简体   繁体   中英

Pandas - count since last transaction

I have a dataframe (call it txn_df ) that contains monetary transaction records, here are the significant columns in this problem:

txn_year    txn_month   custid  withdraw    deposit
2011        4           123     0.0         100.0
2011        5           123     0.0         0.0
2011        6           123     0.0         0.0
2011        7           123     50.1        0.0
2011        8           123     0.0         0.0

Assume also that we have multiple customers here. withdraw and deposit 0.0 value for both means no transaction has taken place. What I want to do is to produce a new column that indicates how many months has occurred since there was a transaction. Something similar to this:

txn_year    txn_month   custid  withdraw    deposit     num_months_since_last_txn
2011        4           123     0.0         100.0       0
2011        5           123     0.0         0.0         1
2011        6           123     0.0         0.0         2           
2011        7           123     50.1        0.0         3
2011        8           123     0.0         0.0         1

The only solution so far that I can think of is to produce a new column has_txn (which is either 1/0 or True/False) when either one of withdraw and deposit has value > 0.0 but I can't continue from there.

one way to solve this problem,

df['series'] =  df[['withdraw','deposit']].ne(0).sum(axis=1)
m = df['series']>=1

As @Chris A commented,

m = df[['withdraw','deposit']].gt(0).any(axis=1) #replacement for above snippet,

df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1).fillna(0)
print df

Output:

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          6     123       0.0      0.0
3      2011          7     123      50.1      0.0
4      2011          8     123       0.0      0.0
   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
2      2011          6     123       0.0      0.0                        2.0
3      2011          7     123      50.1      0.0                        3.0
4      2011          8     123       0.0      0.0                        1.0

Explanation:

  1. To get transaction happened or not use ne and sum to get values in binary.
  2. when transaction is 1 create the series from 0,1,2...n using groupby , cumsum , cumcount .
  3. rearrange the value for 0 using .loc

Note: May be I have added more complex to solving this problem. But It will give you an idea and approach to solve this problem.

Solution for considering customer Id,

df=df.sort_values(by=['custid','txn_month'])
mask=~df.duplicated(subset=['custid'],keep='first')
m = df[['withdraw','deposit']].gt(0).any(axis=1)
df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1)
df.loc[mask,'num_months_since_last_txn']=0

Sample Input:

   txn_year  txn_month  custid  withdraw  deposit
0      2011          4     123       0.0    100.0
1      2011          5     123       0.0      0.0
2      2011          4    1245       0.0    100.0
3      2011          5    1245       0.0      0.0
4      2011          6     123       0.0      0.0
5      2011          7    1245      50.1      0.0
6      2011          7     123      50.1      0.0
7      2011          8     123       0.0      0.0
8      2011          6    1245       0.0      0.0
9      2011          8    1245       0.0      0.0

Sample Output:

   txn_year  txn_month  custid  withdraw  deposit  num_months_since_last_txn
0      2011          4     123       0.0    100.0                        0.0
1      2011          5     123       0.0      0.0                        1.0
4      2011          6     123       0.0      0.0                        2.0
6      2011          7     123      50.1      0.0                        3.0
7      2011          8     123       0.0      0.0                        1.0
2      2011          4    1245       0.0    100.0                        0.0
3      2011          5    1245       0.0      0.0                        1.0
8      2011          6    1245       0.0      0.0                        2.0
5      2011          7    1245      50.1      0.0                        3.0
9      2011          8    1245       0.0      0.0                        1.0

Explanation for considering Customer ID,

  1. The above code works based on the interval between [1,1]. So to make the same format, sort df by cust_id and txn_month, For future you could add txn_year.
  2. fillna(0), won't be work here, Because shift will not create NaN for next customer. To reset into 0 Find duplication of customer Id and take first value replace that into 0.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM