How to compare two .csv and .xlsx files and print out mismatched for a particular field?

Question

So it doesn't matter if they are two different.csv or.xlsx files. But I need a common way on how to tell the mismatched fields. Both Files are different in terms of both shape and size.

For example file A might have 32,000 rows but file B might only have 16,000. This is because I am trying to compare the drift across two different databases from a report. One of the databases is a source for the other. For example: dbA feeds into dbB, making dbA a superset to dbB.

The problem now arises, that I am trying to match employeeID in both databases.

For example, let's say file A contains the following

firstname, lastname, namekey, employeeID, SSN

file B contains

firstname, lastname, namekey, username, email_address, phone_number, EmployeeID, SSN

The Field I have to match would be based on employeeID=EmployeeID. How can I print out a diff view that would only show the rows, where ID did not match?

I don't want rows from file A that are not in file B
I don't want rows from file B that are not in file A
I just want rows where the employee ID is mismatched based on some criteria from both files

The criteria could be anything, technically I can run a SQL command to pull out the.csv or.xlsx file to pull out some key-unique identifier as we have common names but a different employee ID number.

So I guess SSN could be the main filter to say hey, this ID is different for this SSN. I just need a way to accomplish this and generate a single file that shows the difference. I could care less what language I have to use as I am familiar with a lot of different things. But mainly Python or some other tool that would be good at parsing this and is not OS-dependent.

I have tried this, thus far:

vimdiff
git diff --color-words="[^[:space:],]+" x.csv y.csv

Both of them showcase it well, but I don't want rows that are not in both files to appear in the output. Otherwise, it just creates a lot of noise.

Answer 1

To read all values in a column from a csv:

from csv import DictReader as csv_DictReader
csv_file = defaultdict(list)
filepath = "whatever/myfile.csv"
with filepath.open(encoding="cp1252") as file:
    reader = csv_DictReader(file)  
    for row in reader:
        for (k, v) in row.items():
            csv_file[k].append(v)
csv_column = csv_file['employeeID']  # Tell it what column to read

To read all values in a column from excel:

from openpyxl import load_workbook
filepath = "whatever/myfile.xlsx"
excel_file = load_workbook(filepath)
excel_sheet = excel_file.active
excel_columns = {}
for column in "ABC": # Tell it what columns to read
    if column not in excel_columns:
        excel_columns[column] = []
    for row in range(1, excel_sheet.max_row + 1):
        cell_name = f"{column}{row}"
        recovered_columns[column].append(self.excel_sheet[cell_name].value)

So we've read the entire files, but now you have just two dicts, one is csv_column and the other one is excel_columns .

All you have to do now is just to compare the results.

Suggestion: print both csv_column and excel_columns to check what you got using those codes above (because let's be fully honest here, those I just copypasted them from a project that I was working on last year but I forgot half of it already so I'm not entirely sure of the output. It just works).

How to compare two .csv and .xlsx files and print out mismatched for a particular field?

Question

1 answers

solution1
1 ACCPTED 2020-05-05 21:51:05

How to compare two .csv and .xlsx files and print out mismatched for a particular field?

Question

1 answers

solution1 1 ACCPTED 2020-05-05 21:51:05

solution1
1 ACCPTED 2020-05-05 21:51:05