根据另一个数据帧的多个列过滤数据帧

Question

I have a dataframe like this: 我有这样的数据帧：

    ID1    ID2
0   foo    bar
1   fizz   buzz

And another like this: 而另一个像这样：

    ID1    ID2    Count    Code   
0   abc    def      1        A
1   fizz   buzz     5        A
2   fizz1  buzz2    3        C
3   foo    bar      6        Z
4   foo    bar      6        Z

What I would like to do is filter the second dataframe where ID1 and ID2 match a row in the first dataframe, and whenever there's a match I want to remove that row from the first dataframe to avoid duplicates. 我想要做的是过滤第二个数据帧，其中ID1和ID2匹配第一个数据帧中的一行，每当匹配时我想从第一个数据帧中删除该行以避免重复。 This would yield a dataframe that looks like this: 这将产生一个如下所示的数据框：

    ID1    ID2    Count    Code   
1   fizz   buzz     5        A
3   foo    bar      6        Z

I know I can do this by nesting for loops, stepping through all the rows, and manually removing a row from the first frame whenever I get a match but I am wondering if there is a more pythonic way to do this. 我知道我可以通过嵌套for循环，逐步遍历所有行，并在我得到匹配时从第一帧手动删除一行但我想知道是否有更多的pythonic方法来做到这一点。 I am not experienced in pandas so there may be a much cleaner way to do that I do not know about. 我没有大熊猫的经验，所以可能有一个更清洁的方法，我不知道。 I was previously using .isin() but had to scrap it. 我之前使用的是.isin()但不得不废弃它。 Each ID pair can exist in the dataframe up to N times and I need the filtered frame to contain between 0 and N instances of a pair. 每个ID对最多可以存在于数据帧中N次，我需要过滤后的帧包含一对0到N个实例。

Answer 1

Use merge with drop_duplicates , if only same columns for join in both df : 使用与drop_duplicates merge ，如果只有相同的列用于连接两个df ：

df = pd.merge(df1,df2.drop_duplicates())
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1  fizz  buzz      5    A

If need check dupes only in ID columns: 如果只需要在ID列中检查dupe：

df = pd.merge(df1,df2.drop_duplicates(subset=['ID1','ID2']))
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1  fizz  buzz      5    A

If more columns are overlaping add parameter on : 如果更多的列overlaping添加参数on ：

df = pd.merge(df1, df2.drop_duplicates(), on=['ID1','ID2'])

If not remove dupe rows: 如果没有删除欺骗行：

df = pd.merge(df1,df2)
print (df)
    ID1   ID2  Count Code
0   foo   bar      6    Z
1   foo   bar      6    Z
2  fizz  buzz      5    A

Answer 2

尝试这个：

df2.merge(df1[['ID1','ID2']])

Answer 3

Or maybe try this ? 或者试试这个？

df.loc[(df.ID1.isin(df1.ID1))&(df.ID2.isin(df1.ID2)),:].drop_duplicates()


Out[224]: 
    ID1   ID2  Count Code
1  fizz  buzz      5    A
3   foo   bar      6    Z

Answer 4

Using isin on a list of tuples 在元组列表中使用isin

df2[
    pd.Series(
        list(zip(df2.ID1.values, df2.ID2.values))
    ).isin(list(zip(df1.ID1.values, df1.ID2.values)))
]

    ID1   ID2  Count Code
1  fizz  buzz      5    A
3   foo   bar      6    Z
4   foo   bar      6    Z

Answer 5

Merge was almost what I wanted, but didn't quite do the job because I have an odd set of requirements where I need to filter out some duplicates but not all the duplicates. 合并几乎是我想要的，但是没有完成这项工作，因为我有一套奇怪的要求，我需要过滤掉一些重复项，但不是所有重复项。 A regular merge doesn't work because that keeps all the duplicates and drop_duplicates() doesn't work because I need to allow some duplicates. 常规合并不起作用，因为它保留所有重复项并且drop_duplicates()不起作用，因为我需要允许一些重复项。 I ended up going with the method I described in the question and nested for loops. 我最终使用我在问题中描述的方法并嵌套for循环。

temp_frame = pd.DataFrame(columns.df2.columns)
for i in xrange(len(df2)):
    for ii in xrange(len(df1)):
        if df2['ID1'].iloc[i] == df1['ID1'].iloc[ii] and df2['ID2'].iloc[i] == df1['ID2'].iloc[ii]:
            df1.drop(df1.index[ii], inplace=True)
            temp_frame = temp_frame.append(df2.iloc[i], ignore_index=True)
            break
df1 = temp_frame.copy()

根据另一个数据帧的多个列过滤数据帧

问题描述

5 个解决方案

解决方案1
5 已采纳 2017-08-01 14:26:06

解决方案2
3 2017-08-01 14:26:14

解决方案3
2 2017-08-01 14:30:57

解决方案4
2 2017-08-01 14:33:43

解决方案5
1 2017-08-02 13:20:28

根据另一个数据帧的多个列过滤数据帧

问题描述

5 个解决方案

解决方案1 5 已采纳 2017-08-01 14:26:06

解决方案2 3 2017-08-01 14:26:14

解决方案3 2 2017-08-01 14:30:57

解决方案4 2 2017-08-01 14:33:43

解决方案5 1 2017-08-02 13:20:28

解决方案1
5 已采纳 2017-08-01 14:26:06

解决方案2
3 2017-08-01 14:26:14

解决方案3
2 2017-08-01 14:30:57

解决方案4
2 2017-08-01 14:33:43

解决方案5
1 2017-08-02 13:20:28