根据特定列查找数据帧之间的差异并输出整个记录

Question

我想比较 2 个 csv（A 和 B），并仅根据特定列找出 B 中存在但 A 中不存在的行。

我找到了几个答案，但它仍然没有给出我期望的结果。 答案 1：

df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]

这不起作用。 它适用于单列，但不适用于多列。

答案 2 ：

df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)

这将特定列作为输入并输出特定列。 我想打印整个记录，而不仅仅是记录的特定列。

Answer 1

我试过这个，它给了我行：

import pandas as pd

columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)

col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()

new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]

这将为您提供基于columns列表的行。 对不起，不好的命名。 如果您也想稍微重命名列，请使用以下代码：

for column in new.columns:
    
    if '_x' in column:
        new = new.drop(column, axis = 1)
        
    elif '_y' in column:
        new = new.rename(columns = {column: column[:column.find('_y')]})

告诉我它是否有效。

根据特定列查找数据帧之间的差异并输出整个记录

问题描述

1 个解决方案

解决方案1
0 2020-08-23 19:12:29

根据特定列查找数据帧之间的差异并输出整个记录

问题描述

1 个解决方案

解决方案1 0 2020-08-23 19:12:29

解决方案1
0 2020-08-23 19:12:29