简体   繁体   English

如何基于列比较两个不同大小的数据框?

[英]How to compare two data frames of different size based on a column?

I have two data frames with different size我有两个不同大小的数据框

df1 df1

     YearDeci  Year  Month  Day  ...  Magnitude    Lat    Lon  
0     1551.997260  1551     12   31  ...        7.5  34.00  74.50      
1     1661.997260  1661     12   31  ...        7.5  34.00  75.00      
2     1720.535519  1720      7   15  ...        6.5  28.37  77.09      
3     1734.997260  1734     12   31  ...        7.5  34.00  75.00      
4     1777.997260  1777     12   31  ...        7.7  34.00  75.00      

and

df2 df2

         YearDeci  Year  Month  Day  Hour  ...  Seconds   Mb     Lat     Lon  
0     1669.510753  1669      6    4     0  ...        0  NaN  33.400  73.200    
1     1720.535519  1720      7   15     0  ...        0  NaN  28.700  77.200    
2     1780.000000  1780      0    0     0  ...        0  NaN  35.000  77.000    
3     1803.388014  1803      5   22    15  ...        0  NaN  30.600  78.600    
4     1803.665753  1803      9    1     0  ...        0  NaN  30.300  78.800
5     1803.388014  1803      5   22    15  ...        0  NaN  30.600  78.600.

1.I wanted to compare df1 and df2 based on the column 'YearDeci'. 1.我想根据“YearDeci”列比较 df1 和 df2。 and find out the common entries and unique entries(rows other than common rows).并找出常见条目和唯一条目(常见行以外的行)。

2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'. 2.output 基于列“YearDeci”的 df1 中的公共行(相对于 df2)。

3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'. 3.output 基于列“YearDeci”的 df1 中的唯一行(相对于 df2)。

*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable *注意:“YearDeci”中十进制值的差异高达+/-0.0001是可以容忍的

The expected output is like预期的 output 就像

row_common= row_common=

      YearDeci     Year   Month  Day ...   Mb     Lat     Lon 
2     1720.535519  1720      7   15  ...  6.5  28.37  77.09

row_unique= row_unique=

      YearDeci  Year  Month  Day  ...  Magnitude    Lat    Lon  
0     1551.997260  1551     12   31  ...        7.5  34.00  74.50      
1     1661.997260  1661     12   31  ...        7.5  34.00  75.00           
3     1734.997260  1734     12   31  ...        7.5  34.00  75.00      
4     1777.997260  1777     12   31  ...        7.7  34.00  75.00 

First compare df1.YearDeci with df2.YearDeci on the "each with each" principle.首先比较df1.YearDecidf2.YearDeci的“每个与每个”原则。 To perform comparison use np.isclose function with the assumed absolute tolerance.要进行比较,请使用np.isclose function 和假定的绝对公差。

The result is a boolean array:结果是一个boolean数组:

  • first index - index in df1 ,第一个索引 - df1中的索引,
  • second index - index in df2 .第二个索引 - df2中的索引。

Then, using np.argwhere , find indices of True values, ie indices of "correlated" rows from df1 and df2 and create a DateFrame from them.然后,使用np.argwhere ,找到True值的索引,即来自df1df2的“相关”行的索引,并从中创建一个 DateFrame。

The code to perform the above operations is:执行上述操作的代码是:

ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
    df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
    columns=['ind1', 'ind2'])

Then, having pairs of indices pointing to "correlated" rows in both DataFrames, perform the following merge:然后,让索引对指向两个 DataFrame 中的“相关”行,执行以下合并:

result = ind.merge(df1, left_on='ind1', right_index=True)\
    .merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])

The final step is to drop both "auxiliary index columns" ( ind1 and ind2 ):最后一步是删除两个“辅助索引列”( ind1ind2 ):

result.drop(columns=['ind1', 'ind2'], inplace=True)

The result (divided into 2 parts) is:结果(分为2部分)是:

    YearDeci_1  Year_1  Month_1  Day_1  Magnitude  Lat_1  Lon_1   YearDeci_2  \
0  1720.535519    1720        7     15        6.5  28.37  77.09  1720.535519   

   Year_2  Month_2  Day_2  Hour  Seconds  Mb  Lat_2  Lon_2  
0    1720        7     15     0        0 NaN   28.7   77.2  

The indices of the common rows are already in the variable ind公共行的索引已经在变量 ind 中

So to find the unique entries, all we need to do is, drop the common rows from the df1 according to the indices in "ind" So it is better to make another CSV file contain the common entries and read it to a variable.所以要找到唯一的条目,我们需要做的就是根据“ind”中的索引从df1中删除公共行所以最好让另一个CSV文件包含公共条目并将其读取到变量中。

df1_common = pd.read_csv("df1_common.csv")

df1_uniq = df1.drop(df1.index[ind.ind1])

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM