简体   繁体   English

获取两个excel文件之间的差异

[英]Get differences between two excel files

Problem Summary问题总结

Given 2 excel files, each with 200 columns approx, and have a common index column - ie each row in both files would have a name property say, what would be the best to generate an output excel file which just has the differences from excel file 2 to excel file 1. The differences would be defined as any new rows in file 2 not in file1, and rows in file2 that have the same index (name), but one or more of the other columns are different.给定 2 个 excel 文件,每个文件大约有 200 列,并且有一个共同的索引列 - 即两个文件中的每一行都有一个 name 属性说,最好生成一个输出 excel 文件,它与 excel 文件有差异2 到 excel 文件 1。差异将定义为文件 2 中的任何新行而不是文件 1 中的任何新行,以及文件 2 中具有相同索引(名称)但一个或多个其他列不同的行。 There is a good example here using pandas that could be useful : Compare 2 Excel files and output an Excel file with differences Difficult to apply that solution to an excel file with 200 columns though.这里有一个使用 Pandas 的很好的例子,它可能很有用: 比较 2 个 Excel 文件并输出一个有差异的 Excel 文件虽然很难将该解决方案应用于具有 200 列的 Excel 文件。

Sample Files示例文件

Below is a sample of 2 simplified (columns reduced from 200 to 4) excel files in csv format, index column is Name.下面是 2 个 csv 格式的简化(列从 200 减少到 4 个)excel 文件的示例,索引列是名称。

Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Perth,Tim

Name,value,location,Name Copy
Bob,400,Sydney,Bob
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie

So given the above 2 input files, the output file should be :因此,鉴于上述 2 个输入文件,输出文件应为:

Name,value,location,Name Copy
Tim,500,Adelaide,Tim
Melanie,600,Brisbane,Melanie

So the output file would have 2 rows (not including column title row), rows 2 is a new row not in file1, and row 1 contains changes from file1 to file2.因此输出文件将有 2 行(不包括列标题行),第 2 行是不在文件 1 中的新行,第 1 行包含从文件 1 到文件 2 的更改。

The following works, but the index column is lost (it's [1, 2] instead of ['Tim', 'Melanie'] :以下工作,但索引列丢失(它是 [1, 2] 而不是 ['Tim', 'Melanie'] :

import pandas as pd
df1 = pd.read_excel('simple1.xlsx', index_col=0)
df2 = pd.read_excel('simple2.xlsx', index_col=0)

df3 = pd.merge(df1, df2, how='right', sort='False', indicator='Indicator')
df4 = df3.loc[df3['Indicator'] == 'right_only']
df5 = df4.drop('Indicator', axis=1)

writer = pd.ExcelWriter('test.xlsx', engine='xlsxwriter')
df5.to_excel(writer, sheet_name='Sheet1')
writer.save()

The solution was to use numpy.array_equal to determine if rows were equal or not :解决方案是使用 numpy.array_equal 来确定行是否相等:

import sys
import pandas as pd
import numpy as np

# Check for correct number of input arguments
if len(sys.argv) != 4:
    print('Usage :\n\tpython {} old_excel_file new_excel_file  output_excel_file\n'.format(sys.argv[0]))
quit()

# Import input files into dataframes
old_file = sys.argv[1]
new_file = sys.argv[2]
out_file = sys.argv[3]
df1 = pd.read_excel(old_file, index_col=0)
df2 = pd.read_excel(new_file, index_col=0)

# Merge dataframes, maintaining index 
df_merged = pd.merge(df1, df2, left_index=True, right_index=True, how='outer', sort=False, indicator='Indicator')

# Add right-only rows to output dataframe
right_only_index = df_merged.index[df_merged['Indicator'] == 'right_only']
df_out = df2.loc[right_only_index]

# Iterate through "both" rows, and append ones that are not equal to the output dataframe
both_index = df_merged.index[df_merged['Indicator'] == 'both']
df_both = df2.loc[both_index]

for i, values in df_both.iterrows():
    if not np.array_equal(df1.loc[i].values, df2.loc[i].values):
        df_out = df_out.append(df2.loc[i])

# Write output dataframe to an Excel file (first the two header rows, and then the data rows)
writer = pd.ExcelWriter(out_file, engine='xlsxwriter')
df_out.to_excel(writer, sheet_name='Sheet1')
writer.save()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM