简体   繁体   English

Python / Pandas:比较两个数据框中的多列,如果找不到匹配项,则删除行

[英]Python/Pandas: Compare multiple columns in two dataframes and remove row if no matches found

I am learning Python with Pandas and trying to work out the most efficient way to compare multiple selected columns on 2 dataframes to find a match. 我正在使用Pandas学习Python,并尝试找出最有效的方法来比较2个数据帧上的多个选定列以找到匹配项。 For example, if I have the following two dataframes: 例如,如果我有以下两个数据框:

Frame 1
      A     B    C    D    E    F    
001   10    0    0    10   0    10


Frame 2
      A     B    C    D    E    F
200   10    0    10   0    10   0
201   0     10   10   0    0    10
202   0     10   0    0    0    0
203   0     0    0    10   0    10

I'm looking for a way to compare columns A , B , C , D in the 2 dataframes in order to drop rows which do not match 10 in any column. 我正在寻找一种比较2个数据框中的ABCDA ,以便删除在任何列中都不匹配10的行。

In this case, I would expect it to drop rows 201 and 202 because there are no matches, where row 200 and 203 had 1 match (even though row 200 has 1 column that does not match). 在这种情况下,我希望它删除第201202行,因为没有匹配项,第200203行有1个匹配项(即使第200行有1个不匹配的列)。

I've tried looping through all the rows in Frame 2, compare 我尝试遍历第2帧中的所有行,比较

letters = ['A', 'B', 'C', 'D']

for ix, row in frame_2():
    for letter in letters:
        if frame_1[letter].values[0] != frame_2.loc[ix, letter]:
            frame_2.drop(ix, inplace=True)
            break

This removed some rows but not all. 这删除了一些行,但不是全部。

Is there an efficient way to loop through all the rows and check if there's a single match in any of the columns of another dataframe? 有没有一种有效的方法可以遍历所有行并检查另一个数据框的任何列中是否有单个匹配项?

Thanks in advance for the help! 先谢谢您的帮助!

I think simpliest solution is replace non 10 to one value in df1 and another value in df2 , compare each column with isin for possible compare more values if df1 has more rows, create boolean DataFrame, concat and filter by any for test at least one True per row: 我认为最简单的解决方案是将df1一个非10值替换为df2另一个值,将每列与isin进行比较,以便在df1具有更多行的情况下比较更多值,创建boolean DataFrame, concat并按any进行过滤以测试至少一个True每行:

letters = ['A', 'B', 'C', 'D']

out = []
for letter in letters:
    m = df2[letter].mask(lambda x: x!=10, 0).isin(df1[letter].mask(lambda x: x!=10, 1))
    out.append(m)

df = df2[pd.concat(out, axis=1).any(axis=1)]

Alternative solution: 替代解决方案:

df = df2[np.logical_or.reduce(out)]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM