简体   繁体   English

如何根据错误值找到两个熊猫系列之间的交集

[英]How to find intersection between two pandas series based on an error value

I have two pandas dataframes:我有两个熊猫数据框:

df1 = pd.DataFrame({'col1': [1.2574, 5.3221, 4.3215, 9.8841], 'col2': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col1': [4.326, 9.89, 5.326, 1.2654], 'col2': ['w', 'x', 'y', 'z']})

Now I want to compare the values in col1 of both dataframes.现在我想比较两个数据帧的col1中的值。 Consider 5.3221 from df1 , I want to check if this value exists in df2['col1'] with an error of 0.005 (in this very example 5.326 from df2['col1'] should be considered equal to 5.3221 ) and make a third dataframe to hold both columns from df1 and df2 where the above said condition is true.考虑5.3221df1 ,我要检查,如果在存在此值df2['col1']并带有错误0.005 (在这个例子5.326df2['col1']应该被认为等于5.3221 )和使第三数据帧保存来自df1df2两列,其中上述条件为真。

The expected output is:预期的输出是:

    col1    col2    col1.1  col2.2
0   5.3221  b       5.236   y
1   4.3215  c       4.326   w

I have defined a function which is able to take care of the error condition:我定义了一个能够处理错误情况的函数:

def close(a, b, e=0.005):
    return round(abs(a - b), 3) <= e

But I don't know how to apply this on the data without using a for loop.但我不知道如何在不使用for循环的情况下将其for数据。 I also know that I can use numpy.intersect1d but I can not figure out how.我也知道我可以使用numpy.intersect1d但我不知道如何使用。

Any help would be appreciated :)任何帮助,将不胜感激 :)

EDIT: The suggested duplicate answer doesn't address my problem.编辑:建议的重复答案不能解决我的问题。 That question just works on combining two dataframes based on similar looking indices.该问题仅适用于基于相似的索引组合两个数据帧。 Also difflib is used to find word matches and not integer. difflib也用于查找单词匹配而不是整数。 My scenario is completely different.我的情况完全不同。

NumPy's broadcasting can be used for cross comparison and getting the indices in each frame where the difference falls into error margin. NumPy 的广播可用于交叉比较并获取每个帧中差异落入误差范围的索引。 Then we index into the frames and concatenate the results:然后我们对帧进行索引并连接结果:

# find where two frames are close
eps = 0.005
diff = np.abs(df1.col1.to_numpy()[:, np.newaxis] - df2.col1.to_numpy())
inds_1, inds_2 = np.where(diff <= eps)

# filter the frames with these indices
first = df1.iloc[inds_1].reset_index(drop=True)
second = df2.iloc[inds_2].reset_index(drop=True)

# adjust column names of the second one, e.g., "col2.2"
second.columns = [col + f".{j}" for j, col in enumerate(second.columns, start=1)]

# put together
result = pd.concat([first, second], axis=1)

to get要得到

>>> result

     col1 col2  col1.1 col2.2
0  5.3221    b   5.326      y
1  4.3215    c   4.326      w

Intermediate result diff is:中间结果diff是:

>>> diff

array([[3.0686e+00, 8.6326e+00, 4.0686e+00, 8.0000e-03],
       [9.9610e-01, 4.5679e+00, 3.9000e-03, 4.0567e+00],
       [4.5000e-03, 5.5685e+00, 1.0045e+00, 3.0561e+00],
       [5.5581e+00, 5.9000e-03, 4.5581e+00, 8.6187e+00]])

of shape (len(df1), len(df2)) , where ij'th entry is df1.col1[i] - df2.col1[j] .形状(len(df1), len(df2)) ,其中第 ij 个条目是df1.col1[i] - df2.col1[j]

I have added code which words我已经添加了哪些词的代码

First calculate the distance between each point as cross, then filter.首先计算每个点之间的距离作为交叉,然后过滤。 Get those rows and merge获取这些行并合并

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'col1': [1.2574, 5.3221, 4.3215, 9.8841], 'col2': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col1': [4.326, 9.89, 5.326, 1.2654], 'col2': ['w', 'x', 'y', 'z']})

# Get the target columns
c11 = df1['col1'].to_numpy()
c21 = df2['col1'].to_numpy()

# calculate cross errors by broadcast and filter columns
# these will be indices of rows to be inserted in new df
c = np.argwhere(np.abs(c11[:, np.newaxis] - c21) < 0.005)


x = pd.DataFrame()
# Insert by removing index otherwise it will try to match the indexs are change row orders
x[['col1', 'col2']] = df1.iloc[c[:, 0]][['col1', 'col2']].reset_index(drop=True)
x[['col1.1', 'col2.2']] = df2.iloc[c[:, 1]][['col1', 'col2']].reset_index(drop=True)

print(x)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM