简体   繁体   English

使用python返回excel中两个不同文件中两列之间的差异

[英]Returning differences between two columns in two different files in excel using python

I have two csv files with a common column named 'Name'.我有两个 csv 文件,其中有一个名为“Name”的公共列。 File 2 will continuously update and add new values randomly in the column.文件 2 将不断更新并在列中随机添加新值。 How can i write a script to compare the two columns and find the differences regardless of where the new values are placed in file2.我如何编写脚本来比较两列并找出差异,而不管新值放在 file2 中的哪个位置。

Other solutions will find the differences only if the new values are at the end of the column, not randomly within the column.其他解决方案只有在新值位于列末尾时才会发现差异,而不是在列内随机找到。

Code I have tried (only outputs the new values at the bottom of the column, not when it is randomly in the column):我尝试过的代码(仅在列底部输出新值,而不是在列中随机输出新值):

df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')

new_df = (df1[['Name']].merge(df2[['Name']],on='Name',how = 'outer',indicator = True)
                       .query("_merge != 'both'")
                       .drop('_merge',axis = 1))

new_df.to_csv('file4.csv')

File1:文件1:

Name     
gfd454
3v4fd
th678iy

File2:文件2:

Name     
gfd454
fght45
3v4fd
th678iy

The output should be:输出应该是:

Name
fght45
# df1 original dataframe of File_1 data
df1 = pd.DataFrame({'Name':[ 'gfd454' , '3v4fd', 'th678iy']})

# df2 dataframe of changing File_2 data
df2 = pd.DataFrame({'Name':[ 'gfd454' , 'abcde', 'fght45', '3v4fd', 'abcde' ,'th678iy', 'abcde']})

# Assuming df1 comprises distinct elements and doesn't change, and that
# df2 contains all elements of df1 and more (the new updates) 
# df2 may have duplicates like 'abcde'

# Drop duplicates in df2, if df1 has duplicates also drop it first
# ``keep = first`` : Drop duplicates except for the first occurrence.
df2.drop_duplicates(keep='first', inplace=True)
print(df2)

# pandas.concat adds elements of df2 to df1, even if it already exists in df1
df_concat = pd.concat([df1,df2], join='outer', ignore_index = True)
print(df_concat)

df_concat

# now drop the duplicates between df1, df2
df_diff = df_concat .drop_duplicates(keep=False)
print(df_diff)

df_diff

Now, the problem with this is that you have to ensure that df1-df2 = {}, ie df1 is subset of df2现在,问题在于您必须确保 df1-df2 = {},即 df1 是 df2 的子集

Do a left join, using File 2 on the left.使用左侧的文件 2 进行左连接。 After, extract the NaN lines that don't match.之后,提取不匹配的 NaN 行。

If you want to check only on one column you can try it by comparing two lists:如果您只想检查一列,您可以通过比较两个列表来尝试:

list1=df1['Name'].tolist()
list2=df2['Name'].tolist()
s = set(list1)
diff = [x for x in list2 if x not in s]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM