简体   繁体   中英

Excel diff with python

I am looking for an algorithm to comapre two excel sheets, based on their column names, in Python.

I do not know what the columns are, so one sheet may have an additional column or both sheets can have several columns with the same name.

The easiest case is when a column in the first sheet corresponds to only one column in the second excel sheet. Then I can perform the diff on rows of that column using xlrd . If the column name is not unique, I can verify if the columns have the same position.

Does anyone know of an already existing algorithm or have any experience in this domain?

Fast an dirty:

# Since order of the names doesn't matter, we can use the set() option
matching_names = set(sheet_one_names) & set(sheet_one_names)
...
# Here, order does matter since we're comparing rowdata..
# not just if they match at some point.
matching_rowdata = [i for i, j in zip(columndata_one, columndata_two) if i != j]

Note: This assumes that you've done a few things ahead,

  1. get the column names for sheet 1 via xlrd and same for the second sheet,
  2. get the row data for both sheets in two different variables.

This is to give you an idea.

Also note that doing the [...] option (second one) it's important that the rows are of the same length, otherwise it will be skipped. This is a MISS-MATCH scenario, reverse to get the matches in the data flow.

This is a slower but functional solution:

column_a_name = ['Location', 'Building', 'Location']
column_a_data = [['Floor 1', 'Main', 'Sweden'],
                ['Floor 2', 'Main', 'Sweden'],
                ['Floor 3', 'Main', 'Sweden']]

column_b_name = ['Location', 'Building']
column_b_data = [['Sweden', 'Main', 'Floor 1'],
                ['Norway', 'Main', 'Floor 2'],
                ['Sweden', 'Main', 'Floor 3']]

matching_names = []
for pos in range(0, len(column_a_name)):
    try:
        if column_a_name[pos] == column_b_name[pos]:
            matching_names.append((column_a_name[pos], pos))
    except:
        pass # Index out of range, column length are not the same

mismatching_data = []
for row in range(0, len(column_a_data)):
    rowa = column_a_data[row]
    rowb = column_b_data[row]

    for name, _id in matching_names:
        if rowa[_id] != rowb[_id] and (rowa[_id] not in rowb or rowb[_id] not in rowa):
            mismatching_data.append((row, rowa[_id], rowb[_id]))

print mismatching_data

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM