简体   繁体   中英

Compare two dataframes based on column data in Python pandas

I have two dataframes, df1 and df2, and I would like to substruct the df2 from df1 and using as a row comparison a specific column, 'Code'

import pandas as pd
import numpy as np
rng = pd.date_range('2021-01-01', periods=10, freq='D')
df1 = pd.DataFrame(index=rng, data={'Val1': range(10), 'Val2': np.array(range(10))*5, 'Code': [1, 1, 1, 2, 2, 2, 3, 3, 3, 3]})

df2 = pd.DataFrame(data={'Code': [1, 2, 3, 4], 'Val1': [10, 5, 15, 20], 'Val2': [4, 8, 10, 7]})

df1:

            Val1  Val2  Code
2021-01-01     0     0     1
2021-01-02     1     5     1
2021-01-03     2    10     1
2021-01-04     3    15     2
2021-01-05     4    20     2
2021-01-06     5    25     2
2021-01-07     6    30     3
2021-01-08     7    35     3
2021-01-09     8    40     3
2021-01-10     9    45     3

df2:

   Code  Val1  Val2
0     1    10     4
1     2     5     8
2     3    15    10
3     4    20     7

I using the following code:

df = (df1.set_index(['Code']) - df2.set_index(['Code']))

and the result is

Code            
1    -10.0  -4.0
1     -9.0   1.0
1     -8.0   6.0
2     -2.0   7.0
2     -1.0  12.0
2      0.0  17.0
3     -9.0  20.0
3     -8.0  25.0
3     -7.0  30.0
3     -6.0  35.0
4      NaN   NaN

However, I only want to get the results for the rows that are in df1 and not the missing keys, in this example the 4.

How do I do it and then to set back the index to the original from df1?

Something like that but it doesn't work:

df = (df1.set_index(['Code']) - df2.set_index(['Code'])).set_index(df1['Code'])

Also I would like to keep the headers of the columns.

Desired output:

            Val1  Val2  Code
Date                        
2021-01-01 -10.0  -4.0     1
2021-01-02  -9.0   1.0     1
2021-01-03  -8.0   6.0     1
2021-01-04  -2.0   7.0     2
2021-01-05  -1.0  12.0     2
2021-01-06   0.0  17.0     2
2021-01-07  -9.0  20.0     3
2021-01-08  -8.0  25.0     3
2021-01-09  -7.0  30.0     3
2021-01-10  -6.0  35.0     3

If you want to get the results for the rows that are in df1 and not the missing keys, in this example the 4 then just use drop_na() method

df = (df1.set_index(['Code']) - df2.set_index(['Code'])).dropna()

then:-

df.insert(0,'Date',df1.index)

And Finally:-

df.reset_index(inplace=True)
df.set_index('Date',inplace=True)

Now if you print df you will get your desired output:-

           Code  Val1   Val2
Date            
2021-01-01  1   -10.0   -4.0
2021-01-02  1   -9.0    1.0
2021-01-03  1   -8.0    6.0
2021-01-04  2   -2.0    7.0
2021-01-05  2   -1.0    12.0
2021-01-06  2   0.0     17.0
2021-01-07  3   -9.0    20.0
2021-01-08  3   -8.0    25.0
2021-01-09  3   -7.0    30.0
2021-01-10  3   -6.0    35.0

Note:-In case this is not your desired output then let me know

You can use reindex to align df2 to df1["code"] . Then we can take the underlying numpy ndarray and subtract that inplace from the corresponding columns df1 . This will leave both the index and the "code" column untouched and perform subtraction as expected.

subtract_values = df2.set_index("Code").reindex(df1["Code"]).to_numpy()
df1[["Val1", "Val2"]] -= subtract_values

print(df1)
            Val1  Val2  Code
2021-01-01   -10    -4     1
2021-01-02    -9     1     1
2021-01-03    -8     6     1
2021-01-04    -2     7     2
2021-01-05    -1    12     2
2021-01-06     0    17     2
2021-01-07    -9    20     3
2021-01-08    -8    25     3
2021-01-09    -7    30     3
2021-01-10    -6    35     3

If you don't want to change df1 , you can copy the data to a new DataFrame via new_df = df1.copy() and proceeding with new_df instead of df1

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM