简体   繁体   中英

Efficient way subtract a row with previous row seperated by group with Pandas

The objective is to subtract a row ( N ) with previous row ( N-1 ) separated by groups.

Given a df

   years nchar  nval
0   2019     a     1
1   2019     b     1
2   2019     c     1
3   2020     a     1
4   2020     s     4

Lets,separate into group of year 2019 , and we denote it as df_2019

For df_2019, there we assign constant 10 .

Then,only for index 0 , we do the following operation and assign to a new column 'B`

df_2019.loc[df_2019.index[0], 'B']= 10 - df_2019['nval'].values[0]

Whereas, the other index

df_2019.loc[df_2019.index[N], 'B'] = df_2019['B'].values[N-1] - df_2019['nval'].values[N]

This, will produced the following table

   years      nchar nval C   D    B
 
1   2019          a    1          9
2   2019          b    1          8
3   2019          c    1          7

                          

For the group 2020 , the same computation apply. However, the only difference is, the constant value is the 7 , which is taken from the last index of column B.

To answer this requirement, the following code is produced with extra possible groups.

import pandas as pd
year=[2019,2019,2019,2020,2020,2020,2020,2022,2022,2022]
nval=[1,1,1,1,4,1,4,5,6,7]
nchar=['a','b','c','a','s','c','a','b','c','g']

df=pd.DataFrame(zip(year,nchar,nval),columns=['years','nchar','nval'])
print(df)
year_ls=[2019,2020,2022]
nspacing_total=2

nspacing_between_df=4
all_df=[]
default_val=10
for idx,dyear in enumerate(year_ls):
  df_=df[df['years']==dyear].reset_index(drop=True)
  t=pd.DataFrame([[''] * 3]*len(df_), columns=["C", "D", "B"])
  df_=pd.concat([df_,t],axis=1)
  Total = df_['nval'].sum()
  df_=pd.DataFrame([[''] * len(df.columns)]*1, columns=df.columns).append(df_).reset_index(drop=True)

  if idx ==0:
    df_.loc[df_.index[0], 'B']=default_val
    

  if idx !=0:
    
    pre_df=all_df[idx-1]
    pre_val=pre_df['B'].values[-1]

    nposi=1

    pre_years=pre_df['years'].values[nposi]
    df_.loc[df_.index[0], 'nchar']=f'From {pre_years}'
    df_.loc[df_.index[0], 'B']=pre_val

  for ndexd in range(df_.shape[0]-1):
    df_.loc[df_.index[ndexd+1], 'B']=df_['B'].values[ndexd]-df_['nval'].values[ndexd+1]
    
  df_=df_.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_total, columns=df.columns)).reset_index(drop=True)
  df_.loc[df_.index[-1], 'nval']=Total
  df_.loc[df_.index[-1], 'nchar']='Total'
  df_.loc[df_.index[-1], 'B']=df_['B'].values[0]-df_['nval'].values[-1]

  all_df.append(df_)

However, I wonder whether this proposal can be further simplified further using pandas groupby or other. I really appreciate for any tips.

Ultimately, I would like to express the table as below, which will be exported to excel

   years      nchar nval C   D    B
0                           10     
1   2019          a    1          9
2   2019          b    1          8
3   2019          c    1          7
4                                  
5             Total    3          7
6                                  
7                                  
8                                  
9                                  
10        From 2019               7
11  2020          a    1          6
12  2020          s    4          2
13  2020          c    1          1
14  2020          a    4         -3
15                                 
16            Total   10         -3
17                                 
18                                 
19                                 
20                                 
21        From 2020              -3
22  2022          b    5         -8
23  2022          c    6        -14
24  2022          g    7        -21
25                                 
26            Total   18        -21
27                                 
28                                 
29                                 
30                                 

The code to produced the above table

# Optional to represent the table above
all_ap_df=[]
for a_df in all_df:
  df=a_df.append(pd.DataFrame([[''] * len(df.columns)]*nspacing_between_df, columns=df.columns)).reset_index(drop=True)
  all_ap_df.append(df)
df=pd.concat(all_ap_df,axis=0).reset_index(drop=True)


df.loc[df_.index[0], 'D']=df['B'].values[0]
df.loc[df_.index[0], 'B']=''
df = df.fillna('')

I think this is actually quite simple. Use groupby + cumsum :

df['B'] = 10 - df['nval'].cumsum()

Output:

>>> df
   years nchar  nval  B
0   2019     a     1  9
1   2019     b     1  8
2   2019     c     1  7
3   2020     a     1  6
4   2020     s     4  2

In your case chain with groupby

df['new'] = df.groupby('years')['nval'].cumsum().rsub(10)
Out[8]: 
0    9
1    8
2    7
3    9
4    5
Name: nval, dtype: int64

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM