[英]Find the difference between data frames based on specific columns and output the entire record
I want to compare 2 csv (A and B) and find out the rows which are present in B but not in A in based only on specific columns.我想比较 2 个 csv(A 和 B),并仅根据特定列找出 B 中存在但 A 中不存在的行。
I found few answers to that but it is still not giving result what I expect.我找到了几个答案,但它仍然没有给出我期望的结果。 Answer 1 :答案 1:
df = new[~new['column1', 'column2'].isin(old['column1', 'column2'].values)]
This doesn't work.这不起作用。 It works for single column but not for multiple.它适用于单列,但不适用于多列。
Answer 2 :答案 2 :
df = pd.concat([old, new]) # concat dataframes
df = df.reset_index(drop=True) # reset the index
df_gpby = df.groupby(list(df.columns)) #group by
idx = [x[0] for x in df_gpby.groups.values() if len(x) == 1] #reindex
final = df.reindex(idx)
This takes as an input specific columns and also outputs specific columns.这将特定列作为输入并输出特定列。 I want to print the whole record and not only the specific columns of the record.我想打印整个记录,而不仅仅是记录的特定列。
I tried this and it gave me the rows:我试过这个,它给了我行:
import pandas as pd
columns = [{Name of columns you want to use}]
new = pd.merge(A, B, how = 'right', on = columns)
col = new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}']
col = col.dropna()
new = new[~new['{Any column from the first DataFrame which isn't in the list columns. You will probably have to add an '_x' at the end of the column name}'].isin(col)]
This will give you the rows based on the columns
list.这将为您提供基于columns
列表的行。 Sorry for the bad naming.对不起,不好的命名。 If you want to rename the columns a bit too, here's the code for that:如果您也想稍微重命名列,请使用以下代码:
for column in new.columns:
if '_x' in column:
new = new.drop(column, axis = 1)
elif '_y' in column:
new = new.rename(columns = {column: column[:column.find('_y')]})
Tell me if it works.告诉我它是否有效。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.