简体   繁体   English

比较多个列以获取两个Pandas Dataframe中不同的行

[英]Compare Multiple Columns to Get Rows that are Different in Two Pandas Dataframes

I have two dataframes: 我有两个数据帧:

df1=
    A    B   C
0   A0   B0  C0
1   A1   B1  C1
2   A2   B2  C2

df2=
    A    B   C
0   A2   B2  C10
1   A1   B3  C11
2   A9   B4  C12

and I want to find rows in df1 that are not found in df2 based on one or two columns (or more columns). 我想在df1中找到基于一列或两列(或更多列)在df2中找不到的行。 So, if I only compare column 'A' then the following rows from df1 are not found in df2 (note that column 'B' and column 'C' are not used for comparison between df1 and df2) 因此,如果我只比较列'A',则在df2中找不到df1中的以下行(请注意,列'B'和列'C'不用于df1和df2之间的比较)

    A    B   C
0   A0   B0  C0

And I would like to return a series with 我想回一个系列

0   False
1   True
2   True

Or, if I only compare column 'A' and column 'B' then the following rows from df1 are not found in df2 (note that column 'C' is not used for comparison between df1 and df2) 或者,如果我只比较列'A'和列'B',则在df2中找不到df1中的以下行(请注意,列'C'不用于df1和df2之间的比较)

    A    B   C
0   A0   B0  C0
1   A1   B1  C1

And I would want to return a series with 而且我想要回归一系列

0   False
1   False
2   True

I know how to accomplish this using sets but I am looking for a straightforward Pandas way of accomplishing this. 我知道如何使用套装实现这一目标,但我正在寻找一种简单的熊猫方式来实现这一目标。

Ideally, one would like to be able to just use ~df1[COLS].isin(df2[COLS]) as a mask, but this requires index labels to match ( https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html ) 理想情况下,人们希望能够使用~df1 [COLS] .isin(df2 [COLS])作为掩码,但这需要索引标签匹配( https://pandas.pydata.org/pandas-docs/ stable / generated / pandas.DataFrame.isin.html

Here is a succinct form that uses .isin but converts the second DataFrame to a dict so that index labels don't need to match: 这是一个使用.isin的简洁形式,但将第二个DataFrame转换为dict,以便索引标签不需要匹配:

COLS = ['A', 'B'] # or whichever columns to use for comparison

df1[~df1[COLS].isin(df2[COLS].to_dict(
    orient='list')).all(axis=1)]
 ~df1['A'].isin(df2['A'])

Should get you the series you want 应该得到你想要的系列

df1[ ~df1['A'].isin(df2['A'])]

The dataframe: 数据帧:

    A   B   C
0   A0  B0  C0

If your version is 0.17.0 then you can use pd.merge and pass the cols of interest, how='left' and set indicator=True to whether the values are only present in left or both. 如果您的版本是0.17.0那么您可以使用pd.merge并传递感兴趣的cols,how ='left'并设置indicator=True以确定值是仅存在于左侧还是两者中。 You can then test whether the appended _merge col is equal to 'both': 然后,您可以测试附加的_merge col是否等于'both':

In [102]:
pd.merge(df1, df2, on='A',how='left', indicator=True)['_merge'] == 'both'

Out[102]:
0    False
1     True
2     True
Name: _merge, dtype: bool

In [103]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)['_merge'] == 'both'

Out[103]:
0    False
1    False
2     True
Name: _merge, dtype: bool

output from the merge: 合并的输出:

In [104]:
pd.merge(df1, df2, on='A',how='left', indicator=True)

Out[104]:
    A B_x C_x  B_y  C_y     _merge
0  A0  B0  C0  NaN  NaN  left_only
1  A1  B1  C1   B3  C11       both
2  A2  B2  C2   B2  C10       both

In [105]:    
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)

Out[105]:
    A   B C_x  C_y     _merge
0  A0  B0  C0  NaN  left_only
1  A1  B1  C1  NaN  left_only
2  A2  B2  C2  C10       both

Method ( 1 ) 方法(1)


In [63]:
df1['A'].isin(df2['A']) & df1['B'].isin(df2['B'])
Out[63]:

0   False
1   False
2   True

Method ( 2 ) 方法(2)


you can use the left merge to obtain values that exist in both frames + values that exist in the first data frame only 您可以使用左合并来获取两个帧中存在的值+仅存在于第一个数据帧中的值

In [10]:
left = pd.merge(df1 , df2 , on = ['A' , 'B'] ,how = 'left')
left
Out[10]:
    A   B   C_x C_y
0   A0  B0  C0  NaN
1   A1  B1  C1  NaN
2   A2  B2  C2  C10

then of course values that exist only in the first frame will have NAN values in columns of the other data frame , then you can filter by this NAN values by doing the following 当然,仅存在于第一帧中的值将在另一个数据帧的列中具有NAN值,然后您可以通过执行以下操作来过滤此NAN

In [16]:
left.loc[pd.isnull(left['C_y']) , 'A':'C_x']
Out[16]:
    A   B   C_x
0   A0  B0  C0
1   A1  B1  C1

In [17]:

if you want to get whether the values in A exists in B you can do the following 如果您想知道A的值是否存在于B您可以执行以下操作

In [20]:
pd.notnull(left['C_y'])
Out[20]:
0    False
1    False
2     True
Name: C_y, dtype: bool

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM