[英]Compare Multiple Columns to Get Rows that are Different in Two Pandas Dataframes
I have two dataframes: 我有两个数据帧:
df1=
A B C
0 A0 B0 C0
1 A1 B1 C1
2 A2 B2 C2
df2=
A B C
0 A2 B2 C10
1 A1 B3 C11
2 A9 B4 C12
and I want to find rows in df1 that are not found in df2 based on one or two columns (or more columns). 我想在df1中找到基于一列或两列(或更多列)在df2中找不到的行。 So, if I only compare column 'A' then the following rows from df1 are not found in df2 (note that column 'B' and column 'C' are not used for comparison between df1 and df2)
因此,如果我只比较列'A',则在df2中找不到df1中的以下行(请注意,列'B'和列'C'不用于df1和df2之间的比较)
A B C
0 A0 B0 C0
And I would like to return a series with 我想回一个系列
0 False
1 True
2 True
Or, if I only compare column 'A' and column 'B' then the following rows from df1 are not found in df2 (note that column 'C' is not used for comparison between df1 and df2) 或者,如果我只比较列'A'和列'B',则在df2中找不到df1中的以下行(请注意,列'C'不用于df1和df2之间的比较)
A B C
0 A0 B0 C0
1 A1 B1 C1
And I would want to return a series with 而且我想要回归一系列
0 False
1 False
2 True
I know how to accomplish this using sets but I am looking for a straightforward Pandas way of accomplishing this. 我知道如何使用套装实现这一目标,但我正在寻找一种简单的熊猫方式来实现这一目标。
Ideally, one would like to be able to just use ~df1[COLS].isin(df2[COLS]) as a mask, but this requires index labels to match ( https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html ) 理想情况下,人们希望能够使用~df1 [COLS] .isin(df2 [COLS])作为掩码,但这需要索引标签匹配( https://pandas.pydata.org/pandas-docs/ stable / generated / pandas.DataFrame.isin.html )
Here is a succinct form that uses .isin but converts the second DataFrame to a dict so that index labels don't need to match: 这是一个使用.isin的简洁形式,但将第二个DataFrame转换为dict,以便索引标签不需要匹配:
COLS = ['A', 'B'] # or whichever columns to use for comparison
df1[~df1[COLS].isin(df2[COLS].to_dict(
orient='list')).all(axis=1)]
~df1['A'].isin(df2['A'])
Should get you the series you want 应该得到你想要的系列
df1[ ~df1['A'].isin(df2['A'])]
The dataframe: 数据帧:
A B C
0 A0 B0 C0
If your version is 0.17.0
then you can use pd.merge
and pass the cols of interest, how='left' and set indicator=True
to whether the values are only present in left or both. 如果您的版本是
0.17.0
那么您可以使用pd.merge
并传递感兴趣的cols,how ='left'并设置indicator=True
以确定值是仅存在于左侧还是两者中。 You can then test whether the appended _merge
col is equal to 'both': 然后,您可以测试附加的
_merge
col是否等于'both':
In [102]:
pd.merge(df1, df2, on='A',how='left', indicator=True)['_merge'] == 'both'
Out[102]:
0 False
1 True
2 True
Name: _merge, dtype: bool
In [103]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)['_merge'] == 'both'
Out[103]:
0 False
1 False
2 True
Name: _merge, dtype: bool
output from the merge: 合并的输出:
In [104]:
pd.merge(df1, df2, on='A',how='left', indicator=True)
Out[104]:
A B_x C_x B_y C_y _merge
0 A0 B0 C0 NaN NaN left_only
1 A1 B1 C1 B3 C11 both
2 A2 B2 C2 B2 C10 both
In [105]:
pd.merge(df1, df2, on=['A', 'B'],how='left', indicator=True)
Out[105]:
A B C_x C_y _merge
0 A0 B0 C0 NaN left_only
1 A1 B1 C1 NaN left_only
2 A2 B2 C2 C10 both
In [63]:
df1['A'].isin(df2['A']) & df1['B'].isin(df2['B'])
Out[63]:
0 False
1 False
2 True
you can use the left merge to obtain values that exist in both frames +
values that exist in the first data frame only 您可以使用左合并来获取两个帧中存在的值
+
仅存在于第一个数据帧中的值
In [10]:
left = pd.merge(df1 , df2 , on = ['A' , 'B'] ,how = 'left')
left
Out[10]:
A B C_x C_y
0 A0 B0 C0 NaN
1 A1 B1 C1 NaN
2 A2 B2 C2 C10
then of course values that exist only in the first frame will have NAN
values in columns of the other data frame , then you can filter by this NAN
values by doing the following 当然,仅存在于第一帧中的值将在另一个数据帧的列中具有
NAN
值,然后您可以通过执行以下操作来过滤此NAN
值
In [16]:
left.loc[pd.isnull(left['C_y']) , 'A':'C_x']
Out[16]:
A B C_x
0 A0 B0 C0
1 A1 B1 C1
In [17]:
if you want to get whether the values in A
exists in B
you can do the following 如果您想知道
A
的值是否存在于B
您可以执行以下操作
In [20]:
pd.notnull(left['C_y'])
Out[20]:
0 False
1 False
2 True
Name: C_y, dtype: bool
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.