So it doesn't matter if they are two different.csv or.xlsx files. But I need a common way on how to tell the mismatched fields. Both Files are different in terms of both shape and size.
For example file A might have 32,000 rows but file B might only have 16,000. This is because I am trying to compare the drift across two different databases from a report. One of the databases is a source for the other. For example: dbA feeds into dbB, making dbA a superset to dbB.
The problem now arises, that I am trying to match employeeID in both databases.
For example, let's say file A contains the following
firstname, lastname, namekey, employeeID, SSN
file B contains
firstname, lastname, namekey, username, email_address, phone_number, EmployeeID, SSN
The Field I have to match would be based on employeeID=EmployeeID. How can I print out a diff view that would only show the rows, where ID did not match?
The criteria could be anything, technically I can run a SQL command to pull out the.csv or.xlsx file to pull out some key-unique identifier as we have common names but a different employee ID number.
So I guess SSN could be the main filter to say hey, this ID is different for this SSN. I just need a way to accomplish this and generate a single file that shows the difference. I could care less what language I have to use as I am familiar with a lot of different things. But mainly Python or some other tool that would be good at parsing this and is not OS-dependent.
I have tried this, thus far:
vimdiff
git diff --color-words="[^[:space:],]+" x.csv y.csv
Both of them showcase it well, but I don't want rows that are not in both files to appear in the output. Otherwise, it just creates a lot of noise.
To read all values in a column from a csv:
from csv import DictReader as csv_DictReader
csv_file = defaultdict(list)
filepath = "whatever/myfile.csv"
with filepath.open(encoding="cp1252") as file:
reader = csv_DictReader(file)
for row in reader:
for (k, v) in row.items():
csv_file[k].append(v)
csv_column = csv_file['employeeID'] # Tell it what column to read
To read all values in a column from excel:
from openpyxl import load_workbook
filepath = "whatever/myfile.xlsx"
excel_file = load_workbook(filepath)
excel_sheet = excel_file.active
excel_columns = {}
for column in "ABC": # Tell it what columns to read
if column not in excel_columns:
excel_columns[column] = []
for row in range(1, excel_sheet.max_row + 1):
cell_name = f"{column}{row}"
recovered_columns[column].append(self.excel_sheet[cell_name].value)
So we've read the entire files, but now you have just two dicts, one is csv_column
and the other one is excel_columns
.
All you have to do now is just to compare the results.
Suggestion: print both csv_column
and excel_columns
to check what you got using those codes above (because let's be fully honest here, those I just copypasted them from a project that I was working on last year but I forgot half of it already so I'm not entirely sure of the output. It just works).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.