简体   繁体   中英

How do I write an efficient function to calculate the average closing balance for different accounts given a time period

I work in a financial organization. In our transactions table, we keep track of customers' balances only when they transact. For example, if a customer opened an account on the 1st of October with $200 and then withdraws $50 on the 8th of October, he will have just two entries in the transactions table, one for the 2020/10/01 and the other for the 2020/10/8. Now, the focus of this question is on the closing balances . Going by this analogy, if we use today as the cut-off date, you would agree that the customer would have had a closing balance of $200 for 7 days (2020/10/8 - 2020/10/1) and $150 for the remaining 29 days.

Now, I am not sure how to write this function. I have been running into errors and I would appreciate if anyone can help me out with the python code and corresponding comments so that this becomes a valid learning experience for me.

This is a sample of the dataset that I have:

sample_df = pd.DataFrame({'ID': [15, 16, 15, 15, 16, 17, 17, 16],
                         'Calendar_Date': ['2020-10-10', '2020-10-12', '2020-10-12', '2020-10-22', '2020-10-28', '2020-10-30', '2020-11-03', '2020-11-04'] ,
                         'Closing_Balance': [10000, 3000, 6000, 5100, 14500, 25000, 13000, 9000]}) 

and this is the result that I expect:

result_df = pd.DataFrame({'ID':[15, 16, 17],
                         'Total_Days': [26, 24, 6],
                         'Average_Account_Balance': [5823.08, 6375.00, 19000]})

For clarity: This is how I arrived at the result_df:

When ID = 15, Total_Days = (2+10+15) = 27; Average_Account_Balance = ((10000 * 2) + (6000 * 10) + (5100 * 15))/27 = 156500/27 = 5796.3

when ID = 16, Total_Days = (16+7+2) = 25; Average_Account_Balance = ((3000 * 16) + (14500 * 7) + (9000 * 2))/25 = 167500/25 = 6700.00

when ID = 17, Total_Days = (4+3) = 7;
Average_Account_Balance = ((25000 * 4) + (13000 * 3))/7 = 139000/7 = 19857.14

I need the solution to be computationally efficient because you can guess how many transactions we have in our DB. Please feel free to ask further questions if you are not clear on anything stated or implied here. Thank you!

You can break this problem up into a few steps. First, we'll need to make some new columns in the dataframe:

  1. Find the number of days from each date to the end-date (today in your example).
  2. Within each group of "ID" , obtain the difference among previous calculated column to get the number of days between transactions. Then, we use the fillna method to fill in the remaining date differences (eg by using diff we get the difference among rows, but we miss out on the difference between the most recent date within an "ID" and todays date). This builds us a proper "days between transaction" column
  3. Calculate a weighted balance column: simply multiply "Closing_Balance" by the newly created "days between transaction"` column
sample_df["days_from_today"] = (pd.to_datetime("11/06/2020").normalize() - sample_df["Calendar_Date"]).dt.days

sample_df["days_between_transactions"] = (sample_df.groupby("ID")["days_from_today"]
                                          .diff(-1)
                                          .fillna(sample_df["days_from_today"])
                                          .astype(int))

sample_df["weighted_balance"] = sample_df["Closing_Balance"] * sample_df["days_between_transactions"]

print(sample_df)
   ID Calendar_Date  Closing_Balance  days_from_today  days_between_transactions  weighted_balance
0  15    2020-10-10            10000               27                          2             20000
1  16    2020-10-12             3000               25                         16             48000
2  15    2020-10-12             6000               25                         10             60000
3  15    2020-10-22             5100               15                         15             76500
4  16    2020-10-28            14500                9                          7            101500
5  17    2020-10-30            25000                7                          4            100000
6  17    2020-11-03            13000                3                          3             39000
7  16    2020-11-04             9000                2                          2             18000

Now that we've created our additional columns, we can perform a groupby -> aggregation operation to obtain the sum of our "weighted_balance" column and divide it by the max of the "days_from_today" for each unique "ID"

aggregated_df = sample_df.groupby("ID").agg(
    weighted_total_account_balance=("weighted_balance", "sum"), 
    total_days=("days_from_today", "max")
)

aggregated_df["average_account_balance"] = aggregated_df["weighted_total_account_balance"] / aggregated_df["total_days"]

print(aggregated_df)
    weighted_total_account_balance  total_days  average_account_balance
ID                                                                     
15                          156500          27              5796.296296
16                          167500          25              6700.000000
17                          139000           7             19857.142857

I have noticed that there are slight discrepancies in our results, I believe it may be due to differences in our timezones (today is 11/6/2020 for me, not sure what time/day it is for you) so our "total_days" may be different.

Also, if your data is very large, I would recommend using DataFrame.eval to perform the arithmetic operations.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM