简体   繁体   中英

How to show differences from two pandas dataframes of different sizes

If I have two dataframes that look like:

current_month

Product Revenue Expense Profit PaymentFrequency Customer
A 100 100 0 Monthly Cust1
B 200 150 50 Monthly Cust2
C 90 80 10 Monthly Cust3

previous_month

Product Revenue Expense Profit PaymentFrequency Customer
A 120 120 0 Monthly Cust1
B 250 175 75 Monthly Cust1

For each product I would like to have a table of just the differences:

Product A

month Revenue Expense
current_month 100 100
previous_month 120 120

Product B

month Revenue Expense Profit Customer
current_month 200 150 50 Cust2
previous_month 250 175 75 Cust1

Product C

month Revenue Expense Profit PaymentFrequency Customer
current_month 90 80 10 Monthly Cust3
previous_month NaN NaN NaN NaN NaN

I've been able to identify the differences using a for loop and.loc. However, I am struggling to get the desired output.

for product in list(current_month.index):
    for field in list(current_month.columns):
        try:
            if current_month[field].loc[product] != previous_month[field].loc[product]:
                print(f'field: {field}')
                print(f'product: {product}')
                print(f'new value: {current_month[field].loc[product]}')
                print(f'old value: {previous_month[field].loc[product]}') 
        except KeyError:
            print(f'field: {field}')
            print(f'product: {product}')
            print(f'new value: {current_month[field].loc[product]}')
            print(f'NaN')

(i) First merge the dataframes and stack them; this will create a MultiIndex pd.Series object df_m .

(ii) Rename the MultiIndex, sort by it and unstack .

(iii) Filter for products (which is the first level of the MultiIndex), transpose the dataframe and use drop_duplicates if a value is repeated across two months.

df_m = df1.merge(df2, on='Product', how='outer', suffixes=(' current', ' previous')).set_index('Product').stack()
df_m.index = pd.MultiIndex.from_tuples([(i,)+tuple(j.split()) for i,j in df_m.index])
df_m = df_m.sort_index().unstack()


out = [(df_m[df_m.index.get_level_values(0) == product]
        .T
        .replace(np.nan,'NaN')
        .apply(lambda x: x.drop_duplicates(keep=False), axis=0)
        .dropna(axis=1)
        .replace('NaN',np.nan)) 
       for product in ['A','B','C']]
productA, productB, productC = out

Output:

               A        
         Expense Revenue
current      100     100
previous   120.0   120.0

                B                       
         Customer Expense Profit Revenue
current     Cust2     150     50     200
previous    Cust1   175.0   75.0   250.0

                C                                        
         Customer Expense PaymentFrequency Profit Revenue
current     Cust3    80.0          Monthly   10.0    90.0
previous      NaN     NaN              NaN    NaN     NaN

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM