[英]How to compare two data frames of different size based on a column?
I have two data frames with different size我有两个不同大小的数据框
df1 df1
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
and和
df2 df2
YearDeci Year Month Day Hour ... Seconds Mb Lat Lon
0 1669.510753 1669 6 4 0 ... 0 NaN 33.400 73.200
1 1720.535519 1720 7 15 0 ... 0 NaN 28.700 77.200
2 1780.000000 1780 0 0 0 ... 0 NaN 35.000 77.000
3 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600
4 1803.665753 1803 9 1 0 ... 0 NaN 30.300 78.800
5 1803.388014 1803 5 22 15 ... 0 NaN 30.600 78.600.
1.I wanted to compare df1 and df2 based on the column 'YearDeci'. 1.我想根据“YearDeci”列比较 df1 和 df2。 and find out the common entries and unique entries(rows other than common rows).
并找出常见条目和唯一条目(常见行以外的行)。
2.output the common rows(with respect to df2) in df1 based on column 'YearDeci'. 2.output 基于列“YearDeci”的 df1 中的公共行(相对于 df2)。
3.output the unique rows(with respect to df2) in df1 based on column 'YearDeci'. 3.output 基于列“YearDeci”的 df1 中的唯一行(相对于 df2)。
*NB: Difference in decimal values up to +/-0.0001 in the 'YearDeci' is tolerable *注意:“YearDeci”中十进制值的差异高达+/-0.0001是可以容忍的
The expected output is like预期的 output 就像
row_common= row_common=
YearDeci Year Month Day ... Mb Lat Lon
2 1720.535519 1720 7 15 ... 6.5 28.37 77.09
row_unique= row_unique=
YearDeci Year Month Day ... Magnitude Lat Lon
0 1551.997260 1551 12 31 ... 7.5 34.00 74.50
1 1661.997260 1661 12 31 ... 7.5 34.00 75.00
3 1734.997260 1734 12 31 ... 7.5 34.00 75.00
4 1777.997260 1777 12 31 ... 7.7 34.00 75.00
First compare df1.YearDeci with df2.YearDeci on the "each with each" principle.首先比较df1.YearDeci和df2.YearDeci的“每个与每个”原则。 To perform comparison use np.isclose function with the assumed absolute tolerance.
要进行比较,请使用np.isclose function 和假定的绝对公差。
The result is a boolean array:结果是一个boolean数组:
Then, using np.argwhere , find indices of True values, ie indices of "correlated" rows from df1 and df2 and create a DateFrame from them.然后,使用np.argwhere ,找到True值的索引,即来自df1和df2的“相关”行的索引,并从中创建一个 DateFrame。
The code to perform the above operations is:执行上述操作的代码是:
ind = pd.DataFrame(np.argwhere(np.isclose(df1.YearDeci[:, np.newaxis],
df2.YearDeci[np.newaxis, :], atol=0.0001, rtol=0)),
columns=['ind1', 'ind2'])
Then, having pairs of indices pointing to "correlated" rows in both DataFrames, perform the following merge:然后,让索引对指向两个 DataFrame 中的“相关”行,执行以下合并:
result = ind.merge(df1, left_on='ind1', right_index=True)\
.merge(df2, left_on='ind2', right_index=True, suffixes=['_1', '_2'])
The final step is to drop both "auxiliary index columns" ( ind1 and ind2 ):最后一步是删除两个“辅助索引列”( ind1和ind2 ):
result.drop(columns=['ind1', 'ind2'], inplace=True)
The result (divided into 2 parts) is:结果(分为2部分)是:
YearDeci_1 Year_1 Month_1 Day_1 Magnitude Lat_1 Lon_1 YearDeci_2 \
0 1720.535519 1720 7 15 6.5 28.37 77.09 1720.535519
Year_2 Month_2 Day_2 Hour Seconds Mb Lat_2 Lon_2
0 1720 7 15 0 0 NaN 28.7 77.2
The indices of the common rows are already in the variable ind公共行的索引已经在变量 ind 中
So to find the unique entries, all we need to do is, drop the common rows from the df1 according to the indices in "ind" So it is better to make another CSV file contain the common entries and read it to a variable.所以要找到唯一的条目,我们需要做的就是根据“ind”中的索引从df1中删除公共行所以最好让另一个CSV文件包含公共条目并将其读取到变量中。
df1_common = pd.read_csv("df1_common.csv")
df1_uniq = df1.drop(df1.index[ind.ind1])
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.