简体   繁体   中英

How to compare two data frames of different size based on a column?

I have two data frames with different size

df1

     YearDeci  Year  Month  Day  ...  Magnitude    Lat    Lon  
0     1551.997260  1551     12   31  ...        7.5  34.00  74.50      
1     1661.997260  1661     12   31  ...        7.5  34.00  75.00      
2     1720.535519  1720      7   15  ...        6.5  28.37  77.09      
3     1734.997260  1734     12   31  ...        7.5  34.00  75.00      
4     1777.997260  1777     12   31  ...        7.7  34.00  75.00      

and

df2

         YearDeci  Year  Month  Day  Hour  ...  Seconds   Mb     Lat     Lon  
0     1669.510753  1669      6    4     0  ...        0  NaN  33.400  73.200    
1     1720.535519  1720      7   15     0  ...        0  NaN  28.700  77.200    
2     1780.000000  1780      0    0     0  ...        0  NaN  35.000  77.000    
3     1803.388014  1803      5   22    15  ...        0  NaN  30.600  78.600    
4     1803.665753  1803      9    1     0  ...        0  NaN  30.300  78.800
5     1803.388014  1803      5   22    15  ...        0  NaN  30.600  78.600.

1.I wanted to compare df1 and df2 based on the column 'YearDeci'. and find out the common entries and unique entries(rows other than common rows).

2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'.

3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'.

*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable

The expected output is like

row_common=

      YearDeci     Year   Month  Day ...   Mb     Lat     Lon 
2     1720.535519  1720      7   15  ...  6.5  28.37  77.09

row_unique=

      YearDeci  Year  Month  Day  ...  Magnitude    Lat    Lon  
0     1551.997260  1551     12   31  ...        7.5  34.00  74.50      
1     1661.997260  1661     12   31  ...        7.5  34.00  75.00           
3     1734.997260  1734     12   31  ...        7.5  34.00  75.00      
4     1777.997260  1777     12   31  ...        7.7  34.00  75.00 

First compare df1.YearDeci with df2.YearDeci on the "each with each" principle. To perform comparison use np.isclose function with the assumed absolute tolerance.

The result is a boolean array:

  • first index - index in df1 ,
  • second index - index in df2 .

Then, using np.argwhere , find indices of True values, ie indices of "correlated" rows from df1 and df2 and create a DateFrame from them.

The code to perform the above operations is:

ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
    df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
    columns=['ind1', 'ind2'])

Then, having pairs of indices pointing to "correlated" rows in both DataFrames, perform the following merge:

result = ind.merge(df1, left_on='ind1', right_index=True)\
    .merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])

The final step is to drop both "auxiliary index columns" ( ind1 and ind2 ):

result.drop(columns=['ind1', 'ind2'], inplace=True)

The result (divided into 2 parts) is:

    YearDeci_1  Year_1  Month_1  Day_1  Magnitude  Lat_1  Lon_1   YearDeci_2  \
0  1720.535519    1720        7     15        6.5  28.37  77.09  1720.535519   

   Year_2  Month_2  Day_2  Hour  Seconds  Mb  Lat_2  Lon_2  
0    1720        7     15     0        0 NaN   28.7   77.2  

The indices of the common rows are already in the variable ind

So to find the unique entries, all we need to do is, drop the common rows from the df1 according to the indices in "ind" So it is better to make another CSV file contain the common entries and read it to a variable.

df1_common = pd.read_csv("df1_common.csv")

df1_uniq = df1.drop(df1.index[ind.ind1])

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM