[英]Non-matching rows between two DataFrames
I am new to Python and would like to seek you help on this please.我是 Python 的新手,想就此寻求您的帮助。 I would like to find out the non-matching rows between 2 dataframes ie df1 and df2 with thousands of rows.我想找出 2 个数据帧之间的不匹配行,即 df1 和 df2 有数千行。 They both contain the same number of columns with the same name.它们都包含具有相同名称的相同数量的列。
df2 has 10 entries lesser than df1 which I am trying to find out what they are. df2 比 df1 少 10 个条目,我试图找出它们是什么。 I have tried pd.concat([df1,df2]).drop_duplicates(keep=False)
but it returns zero result.我试过pd.concat([df1,df2]).drop_duplicates(keep=False)
但它返回零结果。
What could be the reason?可能是什么原因? Any help/advice would be much appreciated.任何帮助/建议将不胜感激。 Thanks a lot.非常感谢。
The following code will remove rows in df1
that are present in df2
以下代码将删除df1
中存在于df2
中的行
df1[~df1.isin(df2)]
concat
combines two frames. concat
组合两个框架。 You're trying to find the difference between two frames.您正在尝试找出两个帧之间的差异。 This can be done with compare .这可以通过比较来完成。 As the doc example shows, given these two frames:如文档示例所示,给定这两个框架:
df = pd.DataFrame(
{
"col1": ["a", "a", "b", "b", "a"],
"col2": [1.0, 2.0, 3.0, np.nan, 5.0],
"col3": [1.0, 2.0, 3.0, 4.0, 5.0]
},
columns=["col1", "col2", "col3"],
)
df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0
You can find the different rows with compare
:您可以使用compare
找到不同的行:
df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
In this case compare
returns one row for every row that has a difference and only the columns that are actually different.在这种情况下, compare
为每一行有差异的行返回一行,并且只返回实际不同的列。
compare
can return the equal values or all original values as well: compare
也可以返回相等的值或所有原始值:
df.compare(df2, keep_equal=True)
col1 col3
self other self other
0 a c 1.0 1.0
2 b b 3.0 4.0
or或者
df.compare(df2, keep_shape=True, keep_equal=True)
col1 col2 col3
self other self other self other
0 a c 1.0 1.0 1.0 1.0
1 a a 2.0 2.0 2.0 2.0
2 b b 3.0 3.0 3.0 4.0
3 b b NaN NaN 4.0 4.0
4 a a 5.0 5.0 5.0 5.0
Just because you have 10 less entries doesn't mean you're going to find duplicates.仅仅因为您少了 10 个条目并不意味着您会找到重复项。 You probably already have duplicates inside the first dataframe.您可能已经在第一个 dataframe 中有重复项。
Demo:演示:
# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
'col2': list('FFFGHIJ'),
'col3': list('1112345')})
# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
'col2': list('FGHIJ'),
'col3': list('12345')})
Your code:你的代码:
>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []
Try:尝试:
>>> len(df1.drop_duplicates())
5
>>> len(df2.drop_duplicates())
5
Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:假设 df1 和 df2 都是 Pandas Dataframe,下面的代码为匹配的行返回 True,为另一个返回 false:
print((df1 == df2).any(1))
If needed to check each and every columns in all the rows, try this:如果需要检查所有行中的每一列,请尝试以下操作:
print((df1 == df2).stack())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.