简体   繁体   English

如何比较两个.csv 和.xlsx 文件并打印出与特定字段不匹配的内容?

[英]How to compare two .csv and .xlsx files and print out mismatched for a particular field?

So it doesn't matter if they are two different.csv or.xlsx files.因此,它们是否是两个不同的文件并不重要。csv 或 .xlsx 文件。 But I need a common way on how to tell the mismatched fields.但是我需要一种通用的方法来判断不匹配的字段。 Both Files are different in terms of both shape and size.这两个文件在形状和大小方面都不同。

For example file A might have 32,000 rows but file B might only have 16,000.例如,文件 A 可能有 32,000 行,但文件 B 可能只有 16,000 行。 This is because I am trying to compare the drift across two different databases from a report.这是因为我试图从报告中比较两个不同数据库之间的偏差。 One of the databases is a source for the other.其中一个数据库是另一个数据库的来源。 For example: dbA feeds into dbB, making dbA a superset to dbB.例如:dbA 馈入 dbB,使 dbA 成为 dbB 的超集。

The problem now arises, that I am trying to match employeeID in both databases.现在出现了问题,我试图在两个数据库中匹配employeeID。

For example, let's say file A contains the following例如,假设文件 A 包含以下内容

firstname, lastname, namekey, employeeID, SSN

file B contains文件 B 包含

firstname, lastname, namekey, username, email_address, phone_number, EmployeeID, SSN

The Field I have to match would be based on employeeID=EmployeeID.我必须匹配的字段将基于employeeID=EmployeeID。 How can I print out a diff view that would only show the rows, where ID did not match?如何打印出仅显示 ID 不匹配的行的差异视图?

  • I don't want rows from file A that are not in file B我不想要文件 A 中不在文件 B 中的行
  • I don't want rows from file B that are not in file A我不想要文件 B 中不在文件 A 中的行
  • I just want rows where the employee ID is mismatched based on some criteria from both files我只想要根据两个文件中的某些标准员工 ID 不匹配的行

The criteria could be anything, technically I can run a SQL command to pull out the.csv or.xlsx file to pull out some key-unique identifier as we have common names but a different employee ID number.标准可以是任何东西,从技术上讲,我可以运行 SQL 命令来提取 .csv 或 .xlsx 文件以提取一些唯一键标识符,因为我们有通用名称但不同的员工 ID 号。

So I guess SSN could be the main filter to say hey, this ID is different for this SSN.所以我想 SSN 可能是主要的过滤器,嘿,这个 ID 对于这个 SSN 是不同的。 I just need a way to accomplish this and generate a single file that shows the difference.我只需要一种方法来完成此操作并生成一个显示差异的文件。 I could care less what language I have to use as I am familiar with a lot of different things.因为我熟悉很多不同的东西,所以我不太关心我必须使用什么语言。 But mainly Python or some other tool that would be good at parsing this and is not OS-dependent.但主要是 Python 或其他一些可以很好地解析它并且不依赖于操作系统的工具。

I have tried this, thus far:到目前为止,我已经尝试过:

vimdiff
git diff --color-words="[^[:space:],]+" x.csv y.csv

Both of them showcase it well, but I don't want rows that are not in both files to appear in the output.他们都很好地展示了它,但我不希望两个文件中都不存在的行出现在 output 中。 Otherwise, it just creates a lot of noise.否则,它只会产生很多噪音。

To read all values in a column from a csv:要从 csv 读取列中的所有值:

from csv import DictReader as csv_DictReader
csv_file = defaultdict(list)
filepath = "whatever/myfile.csv"
with filepath.open(encoding="cp1252") as file:
    reader = csv_DictReader(file)  
    for row in reader:
        for (k, v) in row.items():
            csv_file[k].append(v)
csv_column = csv_file['employeeID']  # Tell it what column to read

To read all values in a column from excel:要从 excel 中读取列中的所有值:

from openpyxl import load_workbook
filepath = "whatever/myfile.xlsx"
excel_file = load_workbook(filepath)
excel_sheet = excel_file.active
excel_columns = {}
for column in "ABC": # Tell it what columns to read
    if column not in excel_columns:
        excel_columns[column] = []
    for row in range(1, excel_sheet.max_row + 1):
        cell_name = f"{column}{row}"
        recovered_columns[column].append(self.excel_sheet[cell_name].value)

So we've read the entire files, but now you have just two dicts, one is csv_column and the other one is excel_columns .所以我们已经阅读了整个文件,但现在你只有两个字典,一个是csv_column另一个是excel_columns

All you have to do now is just to compare the results.您现在要做的就是比较结果。

Suggestion: print both csv_column and excel_columns to check what you got using those codes above (because let's be fully honest here, those I just copypasted them from a project that I was working on last year but I forgot half of it already so I'm not entirely sure of the output. It just works).建议:打印csv_columnexcel_columns以检查您使用上面的这些代码得到了什么(因为让我们在这里完全诚实,那些我只是从我去年工作的项目中复制粘贴它们,但我已经忘记了其中的一半,所以我不完全确定 output。它只是工作)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM