I work in a financial organization. In our transactions table, we keep track of customers' balances only when they transact. For example, if a customer opened an account on the 1st of October with $200 and then withdraws $50 on the 8th of October, he will have just two entries in the transactions table, one for the 2020/10/01 and the other for the 2020/10/8. Now, the focus of this question is on the closing balances . Going by this analogy, if we use today as the cut-off date, you would agree that the customer would have had a closing balance of $200 for 7 days (2020/10/8 - 2020/10/1) and $150 for the remaining 29 days.
Now, I am not sure how to write this function. I have been running into errors and I would appreciate if anyone can help me out with the python code and corresponding comments so that this becomes a valid learning experience for me.
This is a sample of the dataset that I have:
sample_df = pd.DataFrame({'ID': [15, 16, 15, 15, 16, 17, 17, 16],
'Calendar_Date': ['2020-10-10', '2020-10-12', '2020-10-12', '2020-10-22', '2020-10-28', '2020-10-30', '2020-11-03', '2020-11-04'] ,
'Closing_Balance': [10000, 3000, 6000, 5100, 14500, 25000, 13000, 9000]})
and this is the result that I expect:
result_df = pd.DataFrame({'ID':[15, 16, 17],
'Total_Days': [26, 24, 6],
'Average_Account_Balance': [5823.08, 6375.00, 19000]})
For clarity: This is how I arrived at the result_df:
When ID = 15, Total_Days = (2+10+15) = 27; Average_Account_Balance = ((10000 * 2) + (6000 * 10) + (5100 * 15))/27 = 156500/27 = 5796.3
when ID = 16, Total_Days = (16+7+2) = 25; Average_Account_Balance = ((3000 * 16) + (14500 * 7) + (9000 * 2))/25 = 167500/25 = 6700.00
when ID = 17, Total_Days = (4+3) = 7;
Average_Account_Balance = ((25000 * 4) + (13000 * 3))/7 = 139000/7 = 19857.14
I need the solution to be computationally efficient because you can guess how many transactions we have in our DB. Please feel free to ask further questions if you are not clear on anything stated or implied here. Thank you!
You can break this problem up into a few steps. First, we'll need to make some new columns in the dataframe:
"ID"
, obtain the difference among previous calculated column to get the number of days between transactions. Then, we use the fillna
method to fill in the remaining date differences (eg by using diff
we get the difference among rows, but we miss out on the difference between the most recent date within an "ID"
and todays date). This builds us a proper "days between transaction"
column"Closing_Balance" by the newly created
"days between transaction"` columnsample_df["days_from_today"] = (pd.to_datetime("11/06/2020").normalize() - sample_df["Calendar_Date"]).dt.days
sample_df["days_between_transactions"] = (sample_df.groupby("ID")["days_from_today"]
.diff(-1)
.fillna(sample_df["days_from_today"])
.astype(int))
sample_df["weighted_balance"] = sample_df["Closing_Balance"] * sample_df["days_between_transactions"]
print(sample_df)
ID Calendar_Date Closing_Balance days_from_today days_between_transactions weighted_balance
0 15 2020-10-10 10000 27 2 20000
1 16 2020-10-12 3000 25 16 48000
2 15 2020-10-12 6000 25 10 60000
3 15 2020-10-22 5100 15 15 76500
4 16 2020-10-28 14500 9 7 101500
5 17 2020-10-30 25000 7 4 100000
6 17 2020-11-03 13000 3 3 39000
7 16 2020-11-04 9000 2 2 18000
Now that we've created our additional columns, we can perform a groupby -> aggregation
operation to obtain the sum
of our "weighted_balance"
column and divide it by the max
of the "days_from_today"
for each unique "ID"
aggregated_df = sample_df.groupby("ID").agg(
weighted_total_account_balance=("weighted_balance", "sum"),
total_days=("days_from_today", "max")
)
aggregated_df["average_account_balance"] = aggregated_df["weighted_total_account_balance"] / aggregated_df["total_days"]
print(aggregated_df)
weighted_total_account_balance total_days average_account_balance
ID
15 156500 27 5796.296296
16 167500 25 6700.000000
17 139000 7 19857.142857
I have noticed that there are slight discrepancies in our results, I believe it may be due to differences in our timezones (today is 11/6/2020 for me, not sure what time/day it is for you) so our "total_days" may be different.
Also, if your data is very large, I would recommend using DataFrame.eval
to perform the arithmetic operations.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.