I have a dataframe (call it txn_df
) that contains monetary transaction records, here are the significant columns in this problem:
txn_year txn_month custid withdraw deposit
2011 4 123 0.0 100.0
2011 5 123 0.0 0.0
2011 6 123 0.0 0.0
2011 7 123 50.1 0.0
2011 8 123 0.0 0.0
Assume also that we have multiple customers here. withdraw
and deposit
0.0 value for both means no transaction has taken place. What I want to do is to produce a new column that indicates how many months has occurred since there was a transaction. Something similar to this:
txn_year txn_month custid withdraw deposit num_months_since_last_txn
2011 4 123 0.0 100.0 0
2011 5 123 0.0 0.0 1
2011 6 123 0.0 0.0 2
2011 7 123 50.1 0.0 3
2011 8 123 0.0 0.0 1
The only solution so far that I can think of is to produce a new column has_txn
(which is either 1/0 or True/False) when either one of withdraw
and deposit
has value > 0.0 but I can't continue from there.
one way to solve this problem,
df['series'] = df[['withdraw','deposit']].ne(0).sum(axis=1)
m = df['series']>=1
As @Chris A commented,
m = df[['withdraw','deposit']].gt(0).any(axis=1) #replacement for above snippet,
df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1).fillna(0)
print df
Output:
txn_year txn_month custid withdraw deposit
0 2011 4 123 0.0 100.0
1 2011 5 123 0.0 0.0
2 2011 6 123 0.0 0.0
3 2011 7 123 50.1 0.0
4 2011 8 123 0.0 0.0
txn_year txn_month custid withdraw deposit num_months_since_last_txn
0 2011 4 123 0.0 100.0 0.0
1 2011 5 123 0.0 0.0 1.0
2 2011 6 123 0.0 0.0 2.0
3 2011 7 123 50.1 0.0 3.0
4 2011 8 123 0.0 0.0 1.0
Explanation:
ne
and sum to get values in binary. groupby
, cumsum
, cumcount
. 0
using .loc
Note: May be I have added more complex to solving this problem. But It will give you an idea and approach to solve this problem.
Solution for considering customer Id,
df=df.sort_values(by=['custid','txn_month'])
mask=~df.duplicated(subset=['custid'],keep='first')
m = df[['withdraw','deposit']].gt(0).any(axis=1)
df['num_months_since_last_txn'] = df.groupby(m.cumsum()).cumcount()
df.loc[df['num_months_since_last_txn']==0,'num_months_since_last_txn']=(df['num_months_since_last_txn']+1).shift(1)
df.loc[mask,'num_months_since_last_txn']=0
Sample Input:
txn_year txn_month custid withdraw deposit
0 2011 4 123 0.0 100.0
1 2011 5 123 0.0 0.0
2 2011 4 1245 0.0 100.0
3 2011 5 1245 0.0 0.0
4 2011 6 123 0.0 0.0
5 2011 7 1245 50.1 0.0
6 2011 7 123 50.1 0.0
7 2011 8 123 0.0 0.0
8 2011 6 1245 0.0 0.0
9 2011 8 1245 0.0 0.0
Sample Output:
txn_year txn_month custid withdraw deposit num_months_since_last_txn
0 2011 4 123 0.0 100.0 0.0
1 2011 5 123 0.0 0.0 1.0
4 2011 6 123 0.0 0.0 2.0
6 2011 7 123 50.1 0.0 3.0
7 2011 8 123 0.0 0.0 1.0
2 2011 4 1245 0.0 100.0 0.0
3 2011 5 1245 0.0 0.0 1.0
8 2011 6 1245 0.0 0.0 2.0
5 2011 7 1245 50.1 0.0 3.0
9 2011 8 1245 0.0 0.0 1.0
Explanation for considering Customer ID,
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.