简体   繁体   中英

dataframe diff val from previous row if other columns value match

this is my dataframe analytics: glnumber,nom,Year, YerarMonth,nom,amount

4020 Honoraires de consultation,,2018,201809,234294.31000
4020 Honoraires de consultation,,2018,201810,166337.95000
4020 Honoraires de consultation,,2018,201811,250590.67000
4020 Honoraires de consultation,,2018,201812,92206.82000
4020 Honoraires de consultation,,2019,201901,196868.71000
4020 Honoraires de consultation,,2019,201902,148145.20000
4020 Honoraires de consultation,,2019,201903,110973.24000
4020 Honoraires de consultation,,2019,201904,184858.18000
4020 Honoraires de consultation,,2019,201905,119166.87000
4020 Honoraires de consultation,,2019,201906,10428.10000
4020 Honoraires de consultation,,2019,201907,19927.05000
4020 Honoraires de consultation,,2019,201908,-22677.79000
4020 Honoraires de consultation,,2019,201909,-8560.00000
4020 Honoraires de consultation,,2020,202004,-26.25000
4020 Honoraires de consultation,,2020,202007,-0.02000
4020 Honoraires de consultation,,2021,202101,-105.00000
4020 Honoraires de consultation,,2021,202103,104.99000
4020 Honoraires de consultation,Aclient1,2020,202007,9000.00000
4020 Honoraires de consultation,Aclient1,2020,202008,14040.00000
4020 Honoraires de consultation,Aclient1,2020,202010,31185.00000
4020 Honoraires de consultation,Aclient1,2020,202011,14310.00000
4020 Honoraires de consultation,Aclient1,2020,202012,11160.00000
4020 Honoraires de consultation,Aclient1,2021,202101,14490.00000
4020 Honoraires de consultation,Aclient1,2021,202102,14670.00000
4020 Honoraires de consultation,Aclient2,2020,202003,21045.00000
4020 Honoraires de consultation,Aclient2,2020,202004,13340.00000
4020 Honoraires de consultation,Aclient2C,2020,202006,15640.00000
4020 Honoraires de consultation,Aclient2,2020,202008,54165.00000
4020 Honoraires de consultation,Aclient2,2020,202010,51750.00000
4020 Honoraires de consultation,Aclient2,2020,202011,23000.00000
4020 Honoraires de consultation,Aclient2,2020,202012,19550.00000
4020 Honoraires de consultation,Aclient2,2021,202101,21850.00000
4020 Honoraires de consultation,Aclient2,2021,202102,23000.00000
4020 Honoraires de consultation,Aclient3,2020,202001,937.50000
4020 Honoraires de consultation,Aclient2,2020,202003,437.50000

I want to have difference of amount with same gl, same client with previous month

I tried this but does not work

# check frequency by month by gl
analytics = q1.groupby(['glnumber','nom','Year','YearMonth'])[['amount']].sum().reset_index()
# order
        
#add previous sales to the next row
if analytics['glnumber'] == analytics['glnumber'].shift(1) and analytics['nom'] == analytics['nom'].shift(1):
            analytics['prev_$'] = 0
else:
            analytics['prev_$'] = analytics['amount'].shift(1)
    
#drop the null values and calculate the difference
analytics = analytics.dropna()
analytics['diff'] = (analytics['amount'] - analytics['prev_$'])
analytics = analytics.drop(['prev_$'],
      axis='columns')
    
analytics['Perc_diff'] = np.where(analytics['amount']==0,0,analytics['diff']/analytics['amount'])

my if condition is not working due to this error: ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

You need to check for NaN first and then compare. You can do it as follows in a single np.where condition.

import pandas as pd
import numpy as np
from io import StringIO 
c = ['glnumber','nom','Year', 'YearMonth','nom_amount']
df = pd.read_csv(StringIO(d), sep = ',', header=None, names = c)
df = df.sort_values(by=['glnumber','nom','YearMonth'])
print (df.iloc[:,1:])
df['diff'] = np.where((((df.glnumber.isnull()) | (df.glnumber.shift(1).isnull()) | (df.glnumber == df.glnumber.shift(1))) & 
                       ((df.nom.isnull()) | (df.nom.shift(1).isnull()) | (df.nom == df.nom.shift(1))) & 
                       (df.YearMonth.diff() == 1)), df.nom_amount.diff(), 0)
print (df.iloc[:,1:])

I am checking if glnumber is null or glnumber.shift(1) is null. If they are not, then I am doing a comparison of both values to ensure they are same.

Similarly, for df.nom , checking if df.nom is null or df.nom.shift(1) is null. If not, compare both and see if they are same.

Then checking if the difference between the months is 1 as you want previous month only. If you want to exclude this and consider the previous line to be the previous month, thats OK too.

If it meets the condition, then find the difference between the nom_amount between the two lines. If the condition is not met, then set np.NaN as the value. Alternate, you can set the else to 0.

The output of this will be:

          nom  Year  YearMonth  nom_amount       diff
17   Aclient1  2020     202007     9000.00       0.00
18   Aclient1  2020     202008    14040.00    5040.00
19   Aclient1  2020     202010    31185.00       0.00
20   Aclient1  2020     202011    14310.00  -16875.00
21   Aclient1  2020     202012    11160.00   -3150.00
22   Aclient1  2021     202101    14490.00       0.00
23   Aclient1  2021     202102    14670.00     180.00
24   Aclient2  2020     202003    21045.00       0.00
34   Aclient2  2020     202003      437.50       0.00
25   Aclient2  2020     202004    13340.00   12902.50
27   Aclient2  2020     202008    54165.00       0.00
28   Aclient2  2020     202010    51750.00       0.00
29   Aclient2  2020     202011    23000.00  -28750.00
30   Aclient2  2020     202012    19550.00   -3450.00
31   Aclient2  2021     202101    21850.00       0.00
32   Aclient2  2021     202102    23000.00    1150.00
26  Aclient2C  2020     202006    15640.00       0.00
33   Aclient3  2020     202001      937.50       0.00
0         NaN  2018     201809   234294.31       0.00
1         NaN  2018     201810   166337.95  -67956.36
2         NaN  2018     201811   250590.67   84252.72
3         NaN  2018     201812    92206.82 -158383.85
4         NaN  2019     201901   196868.71       0.00
5         NaN  2019     201902   148145.20  -48723.51
6         NaN  2019     201903   110973.24  -37171.96
7         NaN  2019     201904   184858.18   73884.94
8         NaN  2019     201905   119166.87  -65691.31
9         NaN  2019     201906    10428.10 -108738.77
10        NaN  2019     201907    19927.05    9498.95
11        NaN  2019     201908   -22677.79  -42604.84
12        NaN  2019     201909    -8560.00   14117.79
13        NaN  2020     202004      -26.25       0.00
14        NaN  2020     202007       -0.02       0.00
15        NaN  2021     202101     -105.00       0.00
16        NaN  2021     202103      104.99       0.00

Note that if glnumber and nom are NaN for the second group, then this may result in a small problem. Alternate, you can groupby and do the same.

Groupby will ensure that the glnumber and nom are same for comparison.

Your error occurs because if statements in python require a singular True/False or 0/1 condition. What you're trying to pass is a pandas series full of True/False values, which it doesn't know how to process. What I would do is just do the first step on the entire dataframe, and then index the Series using your if statement logic:

analytics = q1.groupby(['glnumber','nom','Year','YearMonth'])[['amount']].sum().reset_index()
analytics['prev_$'] = analytics['amount'].shift(1)
analytics.loc[(analytics['glnumber'] == analytics['glnumber'].shift(1)) & (analytics['nom'] == analytics['nom'].shift(1)),'prev_$'] = 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM