I want to replace existing values in a column with the average values of the same column using python, preferably. I want to distribute the payments equally to all the months from the first month of payment until the last month. The average monthly payments should be distributed per cust_id and sub_id.
Payments may skip months and are not the same.
I hope you could help me on this as I am only beginning to learn python.
The data looks like this:
cust_id | sub_id | date | payment |
---|---|---|---|
1 | A | 12/1/20 | 200 |
1 | A | 2/2/21 | 200 |
1 | A | 2/3/21 | 100 |
1 | A | 5/1/21 | 200 |
1 | B | 1/2/21 | 50 |
1 | B | 1/9/21 | 20 |
1 | B | 3/1/21 | 80 |
1 | B | 4/23/21 | 90 |
2 | C | 1/4/21 | 200 |
2 | C | 1/9/21 | 300 |
The result I want is this:
cust_id | sub_id | date | payment |
---|---|---|---|
1 | A | 12/1/20 | 116.67 |
1 | A | 1/1/21 | 116.67 |
1 | A | 2/1/21 | 116.67 |
1 | A | 3/1/21 | 116.67 |
1 | A | 4/1/21 | 116.67 |
1 | A | 5/1/21 | 116.67 |
1 | B | 1/1/21 | 60 |
1 | B | 2/1/21 | 60 |
1 | B | 3/1/21 | 60 |
1 | B | 4/1/21 | 60 |
2 | C | 1/1/21 | 500 |
Thank you very much!
As noted in the comments your answer for cust_id=2 and sub_id='C' appears to be inconsistent with your requirements, so I go by the latter.
First, we aggregate dates into a min,max and payments into a sum:
df2 = df.groupby(['cust_id','sub_id']).agg({'date':[min,max], 'payment':sum})
df2.columns = df2.columns.get_level_values(1)
df2
and we get
min max sum
cust_id sub_id
1 A 2020-12-01 2021-05-01 700
B 2021-01-02 2021-04-23 240
2 C 2021-01-04 2021-01-09 500
Then we create a monthly schedule for each row from min to max. Here you may have to fiddle with the dates a bit to have them nicely lined up, I just did the basics to show the idea:
from datetime import timedelta
df2['schedule'] = df2.apply(lambda row: pd.date_range(row['min'],row['max'] + timedelta(days = 31), freq = '1M'),axis=1)
Now df2
looks like this:
min max sum schedule
-------- ------------------- ------------------- ----- ---------------------------------------------------------------------------------------------------------
(1, 'A') 2020-12-01 00:00:00 2021-05-01 00:00:00 700 DatetimeIndex(['2020-12-31', '2021-01-31', '2021-02-28', '2021-03-31',
'2021-04-30', '2021-05-31'],
dtype='datetime64[ns]', freq='M')
(1, 'B') 2021-01-02 00:00:00 2021-04-23 00:00:00 240 DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30'], dtype='datetime64[ns]', freq='M')
(2, 'C') 2021-01-04 00:00:00 2021-01-09 00:00:00 500 DatetimeIndex(['2021-01-31'], dtype='datetime64[ns]', freq='M')
Now we explode
our 'schedule' and allocate payments equally, and do some cleanup on column names etc:
df3 = df2.groupby(['cust_id','sub_id'], as_index = False).apply(lambda g: g.explode('schedule'))
(df3.groupby(['cust_id','sub_id'], as_index = False)
.apply(lambda g: g.assign(sum = g['sum']/len(g)))
.reset_index(drop = False)
.drop(columns = ['min','max','level_0'])
.rename(columns = {'sum':'payment'})
)
to get
cust_id sub_id payment schedule
-- --------- -------- --------- -------------------
0 1 A 116.667 2020-12-31 00:00:00
1 1 A 116.667 2021-01-31 00:00:00
2 1 A 116.667 2021-02-28 00:00:00
3 1 A 116.667 2021-03-31 00:00:00
4 1 A 116.667 2021-04-30 00:00:00
5 1 A 116.667 2021-05-31 00:00:00
6 1 B 60 2021-01-31 00:00:00
7 1 B 60 2021-02-28 00:00:00
8 1 B 60 2021-03-31 00:00:00
9 1 B 60 2021-04-30 00:00:00
10 2 C 500 2021-01-31 00:00:00
This can be done in just a couple of steps using the resample()
and transform()
functions:
First, we add the missing months to the original table, changing all date values to the first of the month, combining rows for the same month with the original values of payment added, and putting 0's in the payment column in new rows:
resampled_df = (df
.set_index('date')
.groupby(['cust_id', 'sub_id'])
.resample('MS')
.agg({'payment': sum})
.reset_index()
)
Then, we calculate the average across all months for each group and assign that average to every row in the group, assigning the result to a new column:
resampled_df['avg_monthly_payment'] = (resampled_df
.groupby(['cust_id', 'sub_id'])['payment']
.transform('mean')
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.