简体   繁体   English

两个 DataFrame 之间的不匹配行

[英]Non-matching rows between two DataFrames

I am new to Python and would like to seek you help on this please.我是 Python 的新手,想就此寻求您的帮助。 I would like to find out the non-matching rows between 2 dataframes ie df1 and df2 with thousands of rows.我想找出 2 个数据帧之间的不匹配行,即 df1 和 df2 有数千行。 They both contain the same number of columns with the same name.它们都包含具有相同名称的相同数量的列。

df2 has 10 entries lesser than df1 which I am trying to find out what they are. df2 比 df1 少 10 个条目,我试图找出它们是什么。 I have tried pd.concat([df1,df2]).drop_duplicates(keep=False) but it returns zero result.我试过pd.concat([df1,df2]).drop_duplicates(keep=False)但它返回零结果。

What could be the reason?可能是什么原因? Any help/advice would be much appreciated.任何帮助/建议将不胜感激。 Thanks a lot.非常感谢。

The following code will remove rows in df1 that are present in df2以下代码将删除df1中存在于df2中的行

df1[~df1.isin(df2)]

concat combines two frames. concat组合两个框架。 You're trying to find the difference between two frames.您正在尝试找出两个帧之间的差异。 This can be done with compare .这可以通过比较来完成。 As the doc example shows, given these two frames:如文档示例所示,给定这两个框架:

df = pd.DataFrame(
    {
        "col1": ["a", "a", "b", "b", "a"],
        "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
        "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
    },
    columns=["col1", "col2", "col3"],
)

df2 = df.copy()
df2.loc[0, 'col1'] = 'c'
df2.loc[2, 'col3'] = 4.0

You can find the different rows with compare :您可以使用compare找到不同的行:

df.compare(df2)
  col1       col3
  self other self other
0    a     c  NaN   NaN
2  NaN   NaN  3.0   4.0

In this case compare returns one row for every row that has a difference and only the columns that are actually different.在这种情况下, compare为每一行有差异的行返回一行,并且只返回实际不同的列。

compare can return the equal values or all original values as well: compare也可以返回相等的值或所有原始值:

df.compare(df2, keep_equal=True)
  col1       col3
  self other self other
0    a     c  1.0   1.0
2    b     b  3.0   4.0

or或者

df.compare(df2, keep_shape=True, keep_equal=True)
  col1       col2       col3
  self other self other self other
0    a     c  1.0   1.0  1.0   1.0
1    a     a  2.0   2.0  2.0   2.0
2    b     b  3.0   3.0  3.0   4.0
3    b     b  NaN   NaN  4.0   4.0
4    a     a  5.0   5.0  5.0   5.0

Just because you have 10 less entries doesn't mean you're going to find duplicates.仅仅因为您少了 10 个条目并不意味着您会找到重复项。 You probably already have duplicates inside the first dataframe.您可能已经在第一个 dataframe 中有重复项。

Demo:演示:

# Length: 7
df1 = pd.DataFrame({'col1': list('AAABCDE'),
                    'col2': list('FFFGHIJ'),
                    'col3': list('1112345')})

# Length: 5
df2 = pd.DataFrame({'col1': list('ABCDE'),
                    'col2': list('FGHIJ'),
                    'col3': list('12345')})

Your code:你的代码:

>>> pd.concat([df1,df2]).drop_duplicates(keep=False)
Empty DataFrame
Columns: [col1, col2, col3]
Index: []

Try:尝试:

>>> len(df1.drop_duplicates())
5

>>> len(df2.drop_duplicates())
5

Assuming both df1 and df2 are Pandas Dataframe, the following code returns True for matching rows and false for the other:假设 df1 和 df2 都是 Pandas Dataframe,下面的代码为匹配的行返回 True,为另一个返回 false:

print((df1 == df2).any(1))

If needed to check each and every columns in all the rows, try this:如果需要检查所有行中的每一列,请尝试以下操作:

print((df1 == df2).stack())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM