简体   繁体   中英

Finding the average of values in a column and create a new dataframe that distributes the average

I want to replace existing values in a column with the average values of the same column using python, preferably. I want to distribute the payments equally to all the months from the first month of payment until the last month. The average monthly payments should be distributed per cust_id and sub_id.

Payments may skip months and are not the same.

I hope you could help me on this as I am only beginning to learn python.

The data looks like this:

cust_id sub_id date payment
1 A 12/1/20 200
1 A 2/2/21 200
1 A 2/3/21 100
1 A 5/1/21 200
1 B 1/2/21 50
1 B 1/9/21 20
1 B 3/1/21 80
1 B 4/23/21 90
2 C 1/4/21 200
2 C 1/9/21 300

The result I want is this:

cust_id sub_id date payment
1 A 12/1/20 116.67
1 A 1/1/21 116.67
1 A 2/1/21 116.67
1 A 3/1/21 116.67
1 A 4/1/21 116.67
1 A 5/1/21 116.67
1 B 1/1/21 60
1 B 2/1/21 60
1 B 3/1/21 60
1 B 4/1/21 60
2 C 1/1/21 500

Thank you very much!

As noted in the comments your answer for cust_id=2 and sub_id='C' appears to be inconsistent with your requirements, so I go by the latter.

First, we aggregate dates into a min,max and payments into a sum:

df2 = df.groupby(['cust_id','sub_id']).agg({'date':[min,max], 'payment':sum})
df2.columns = df2.columns.get_level_values(1)
df2

and we get

        min         max         sum
cust_id sub_id          
1   A   2020-12-01  2021-05-01  700
    B   2021-01-02  2021-04-23  240
2   C   2021-01-04  2021-01-09  500

Then we create a monthly schedule for each row from min to max. Here you may have to fiddle with the dates a bit to have them nicely lined up, I just did the basics to show the idea:

from datetime import timedelta
df2['schedule'] = df2.apply(lambda row: pd.date_range(row['min'],row['max'] + timedelta(days = 31), freq = '1M'),axis=1)

Now df2 looks like this:


          min                  max                    sum  schedule
--------  -------------------  -------------------  -----  ---------------------------------------------------------------------------------------------------------
(1, 'A')  2020-12-01 00:00:00  2021-05-01 00:00:00    700  DatetimeIndex(['2020-12-31', '2021-01-31', '2021-02-28', '2021-03-31',
                                                                          '2021-04-30', '2021-05-31'],
                                                                         dtype='datetime64[ns]', freq='M')
(1, 'B')  2021-01-02 00:00:00  2021-04-23 00:00:00    240  DatetimeIndex(['2021-01-31', '2021-02-28', '2021-03-31', '2021-04-30'], dtype='datetime64[ns]', freq='M')
(2, 'C')  2021-01-04 00:00:00  2021-01-09 00:00:00    500  DatetimeIndex(['2021-01-31'], dtype='datetime64[ns]', freq='M')

Now we explode our 'schedule' and allocate payments equally, and do some cleanup on column names etc:

df3 = df2.groupby(['cust_id','sub_id'], as_index = False).apply(lambda g: g.explode('schedule'))
(df3.groupby(['cust_id','sub_id'], as_index = False)
    .apply(lambda g: g.assign(sum = g['sum']/len(g)))
    .reset_index(drop = False)
    .drop(columns = ['min','max','level_0'])
    .rename(columns = {'sum':'payment'})
)

to get

      cust_id  sub_id      payment  schedule
--  ---------  --------  ---------  -------------------
 0          1  A           116.667  2020-12-31 00:00:00
 1          1  A           116.667  2021-01-31 00:00:00
 2          1  A           116.667  2021-02-28 00:00:00
 3          1  A           116.667  2021-03-31 00:00:00
 4          1  A           116.667  2021-04-30 00:00:00
 5          1  A           116.667  2021-05-31 00:00:00
 6          1  B            60      2021-01-31 00:00:00
 7          1  B            60      2021-02-28 00:00:00
 8          1  B            60      2021-03-31 00:00:00
 9          1  B            60      2021-04-30 00:00:00
10          2  C           500      2021-01-31 00:00:00

This can be done in just a couple of steps using the resample() and transform() functions:

First, we add the missing months to the original table, changing all date values to the first of the month, combining rows for the same month with the original values of payment added, and putting 0's in the payment column in new rows:

resampled_df = (df
   .set_index('date')
   .groupby(['cust_id', 'sub_id'])
   .resample('MS')
   .agg({'payment': sum})
   .reset_index()
)

Then, we calculate the average across all months for each group and assign that average to every row in the group, assigning the result to a new column:

resampled_df['avg_monthly_payment'] = (resampled_df
   .groupby(['cust_id', 'sub_id'])['payment']
   .transform('mean')
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM