简体   繁体   中英

Pandas data frame calculate avegrage of same months last 3 years and append in the same data frame

I have a pandas data frame like this:

Account Id  Gross Sum   Invoice Type Name      Net Sum Company     Security         Supplier Date Completed YearMonth   Category
710830      282.81      Invoice                282.81              asd5a            Abc      1/1/2018       2018-1      Postal
445800      4868.71     Invoice                3926.4              adc6ac           Def      1/1/2018       2018-1      R&D
710350      282.81      Invoice                282.81              fgn6             Ghi      2/9/2018       2018-2      Other
710510      282.81      Invoice                282.81              dg               jkl      2/9/2018       2018-2      Electricity
710630      841.59      Invoice                707.07              dfvbfbf          mno      3/2/2018       2018-3      Repairs
710610      841.59      Invoice                707.07              rrcv             pqr      3/2/2018       2018-3      Leasing
710810      12.14       Invoice                10.12               btbfd            stu      1/1/2019       2019-1      Telephone
704300      81517.6     Invoice                65740               dfbtt            vwx      1/1/2019       2019-1      Statutory
710510      2105.64     Invoice                1776.53             dfdftb5          dfb      2/9/2019       2019-2      Electricity
710510      2105.64     Invoice                1776.53             ebdfb5b          bcd      2/9/2019       2019-2      Electricity
710920      66.96       Invoice                54                  dfrrt65          efg      3/2/2019       2019-3      Data
700330      239.47      Invoice                239.47              aae3a11          hij      3/2/2019       2019-3      Coffee

What i want is to add rows at the bottom of the data frame that calculates the average of same month last 3 years.

For example : For year month 2020-1 the calculation should be for 2020-1 = sum(Net Sum Company) In 2019-1 + sum(Net Sum Company) in 2018-1 + sum(Net Sum Company) In 2017-1 divided by the number of months considered ie 3 , so only last three years has to be considered. That way i'll get the average and append the same as new row at the bottom that has nothing but the Year Month and average of net sum company column. The end goal is to get a data frame like this:

    Account Id  Gross Sum   Invoice Type Name      Net Sum Company     Security         Supplier Date Completed YearMonth   Category
710830      282.81      Invoice                282.81              asd5a            Abc      1/1/2018       2018-1      Postal
445800      4868.71     Invoice                3926.4              adc6ac           Def      1/1/2018       2018-1      R&D
710350      282.81      Invoice                282.81              fgn6             Ghi      2/9/2018       2018-2      Other
710510      282.81      Invoice                282.81              dg               jkl      2/9/2018       2018-2      Electricity
710630      841.59      Invoice                707.07              dfvbfbf          mno      3/2/2018       2018-3      Repairs
710610      841.59      Invoice                707.07              rrcv             pqr      3/2/2018       2018-3      Leasing
710810      12.14       Invoice                10.12               btbfd            stu      1/1/2019       2019-1      Telephone
704300      81517.6     Invoice                65740               dfbtt            vwx      1/1/2019       2019-1      Statutory
710510      2105.64     Invoice                1776.53             dfdftb5          dfb      2/9/2019       2019-2      Electricity
710510      2105.64     Invoice                1776.53             ebdfb5b          bcd      2/9/2019       2019-2      Electricity
710920      66.96       Invoice                54                  dfrrt65          efg      3/2/2019       2019-3      Data
700330      239.47      Invoice                239.47              aae3a11          hij      3/2/2019       2019-3      Coffee
-              -           -                   34979.66            -                -        -              2020-1      -
-              -           -                   2059.34             -                -        -              2020-2      -
-              -           -                   853.805             -                -        -              2020-3      -

I am new to pandas so any guidance is appreciated. This has to be strictly done using pandas only.

For a simple 3y rolling average, do something like this:

df1['Date Completed'] = pd.to_datetime(df1['Date Completed'])
df1['roll_3y_avg'] = df1.rolling(window='1096D', on='Date Completed', closed='right')['Net Sum Company'].mean()

IIUC, you want to:

  • find the next year per month in the dataframe
  • sum per month the Net Sum Company column over the 3 previous years
  • divide each sum by the number of months (2 in the sample) to get a monthly average
  • add those averages to the dataframe with the new year and the month in the YearMonth column

Code could be:

# extract Year and Month Series from the dataframe
year = df['YearMonth'].str.slice(stop=4).astype(int)
month = df['YearMonth'].str.slice(start=5)

# compute the new year per month as max(year) + 1
newyear_month = year.groupby(month).max() + 1

# build a Series aligned with the dataframe from that new year
newyear = pd.DataFrame(month).merge(
    pd.DataFrame(newyear_month),
    left_on='YearMonth', right_index=True, suffixes=('_x', '')
    )['YearMonth'].sort_index()

# compute the sum of relevant years per month
tmp = df.loc[(newyear-3 <= year) & (year <= newyear-1), 'Net Sum Company'
             ].groupby(month).sum()

# divide by the number of distinct month per sum
tmp /= df.groupby(month)['YearMonth'].nunique()

# compute a YearMonth column for that new dataframe
tmp = pd.concat([newyear_month.astype(str), tmp], axis=1)
tmp['YearMonth'] = tmp['YearMonth'] + '-' + tmp.index  # tmp is indexed by month

# force the type of Account Id to object to allow it to contain null values
df['Account Id'] = df['Account Id'].astype(object)

# concat the new rows to the dataframe and reset the index
new_df = df.append(tmp, sort=False).reset_index(drop=True)

With your sample, new_df gives:

   Account Id  Gross Sum Invoice Type Name  Net Sum Company Security Supplier Date Completed YearMonth     Category
0      710830     282.81           Invoice          282.810    asd5a      Abc       1/1/2018    2018-1       Postal
1      445800    4868.71           Invoice         3926.400   adc6ac      Def       1/1/2018    2018-1          R&D
2      710350     282.81           Invoice          282.810     fgn6      Ghi       2/9/2018    2018-2        Other
3      710510     282.81           Invoice          282.810       dg      jkl       2/9/2018    2018-2  Electricity
4      710630     841.59           Invoice          707.070  dfvbfbf      mno       3/2/2018    2018-3      Repairs
5      710610     841.59           Invoice          707.070     rrcv      pqr       3/2/2018    2018-3      Leasing
6      710810      12.14           Invoice           10.120    btbfd      stu       1/1/2019    2019-1    Telephone
7      704300   81517.60           Invoice        65740.000    dfbtt      vwx       1/1/2019    2019-1    Statutory
8      710510    2105.64           Invoice         1776.530  dfdftb5      dfb       2/9/2019    2019-2  Electricity
9      710510    2105.64           Invoice         1776.530  ebdfb5b      bcd       2/9/2019    2019-2  Electricity
10     710920      66.96           Invoice           54.000  dfrrt65      efg       3/2/2019    2019-3         Data
11     700330     239.47           Invoice          239.470  aae3a11      hij       3/2/2019    2019-3       Coffee
12        NaN        NaN               NaN        34979.665      NaN      NaN            NaN    2020-1          NaN
13        NaN        NaN               NaN         2059.340      NaN      NaN            NaN    2020-2          NaN
14        NaN        NaN               NaN          853.805      NaN      NaN            NaN    2020-3          NaN

Remarks:

  • finding the new year per month allows to use the code on a rolling year (from July 2017 to June 2019 for example)
  • you can replace NaN with empty strings (or whatever) with new_df = new_df.fillna('')

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM