如何在 Python 中比较两个 CSV 文件？

Question

I've two CSV file named as file1.csv and file2.csv in file2.csv there is only one column which contain only five records and in file1.csv I've three column which contain more than thousand records I want to get those records which contain in file2.csv for example this is my file1.csv我在file2.csv中有两个名为file1.csv和file2.csv的CSV文件，只有一列只包含五条记录，而在file1.csv我有三列，其中包含超过一千条记录我想得到那些包含在file2.csv中的记录，例如这是我的file1.csv

'A J1, Jhon1',jhon1@jhon.com, A/B-201 Test1
'A J2, Jhon2',jhon2@jhon.com, A/B-202 Test2
'A J3, Jhon3',jhon3@jhon.com, A/B-203 Test3
'A J4, Jhon4',jhon4@jhon.com, A/B-204 Test4
.......and more records

and inside my file2.csv I've only five records right now but in future it can be many在我的file2.csv 中，我现在只有五条记录，但将来可能会有很多

A/B-201 Test1
A/B-2012 Test12
A/B-203 Test3
A/B-2022 Test22

so I've to find records from my file1.csv at index[2] or index[-1]所以我必须在index[2]或index[-1]从我的file1.csv中找到记录

this is what I did but it not giving me any output it just returning empty list这就是我所做的，但它没有给我任何输出它只是返回空列表

import csv 

file1 = open('file1.csv','r')
file2 = open('file2.csv','r')

f1 = list(csv.reader(file1))
f2 = list(csv.reader(file2))


new_list = []

for i in f1:
  if i[-1] in f2:
     new_list.append(i)

print('New List : ',new_list)

it gives me output like this它给了我这样的输出

New List :  []

Please help if I did any thing wrong correct me.如果我做错了什么，请帮助纠正我。

Answer 1

Method 1: `pandas`方法一： `pandas`

This task can be done with relative ease using pandas .使用pandas可以相对轻松地完成此任务。 DataFrame documentation here . DataFrame 文档在这里。

Example:例子：

In the example below, the two CSV files are read into two DataFrames.在下面的示例中，两个 CSV 文件被读入两个 DataFrame。 The DataFrames are merged using an inner join on the matching columns. DataFrame 使用匹配列上的内部连接进行合并。

The output shows the merged result.输出显示合并的结果。

import pandas as pd

df1 = pd.read_csv('file1.csv', names=['col1', 'col2', 'col3'], quotechar="'", skipinitialspace=True)
df2 = pd.read_csv('file2.csv', names=['match'])

df = pd.merge(df1, df2, left_on=df1['col3'], right_on=df2['match'], how='inner')

The quotechar and skipinitialspace parameters are used as the first column in file1 is quoted and contains a comma, and there is leading whitespace after the comma before the last field. quotechar和skipinitialspace参数用作file1中的第一列被引用并包含逗号，并且在最后一个字段之前的逗号之后有前导空格。

Output:输出：

    col1            col2            col3
0   A J1, Jhon1     jhon1@jhon.com  A/B-201 Test1
1   A J3, Jhon3     jhon3@jhon.com  A/B-203 Test3

If you choose, the output can easily be written back to a CSV file as:如果您选择，可以将输出轻松写回 CSV 文件，如下所示：

df.to_csv('path/to/output.csv')

For other DataFrame operations, refer to the documentation linked above.有关其他 DataFrame 操作，请参阅上面链接的文档。

Method 2: Core Python方法2：核心Python

The method below does not use any libraries, only core Python.下面的方法不使用任何库，只使用核心 Python。

Read the matches from file2 into a list.将file2中的匹配项读入列表。
Iterate over file1 and search each line to determine if the last value is a match for an item in file2 .遍历file1并搜索每一行以确定最后一个值是否与file2中的项目匹配。
Report the output.报告输出。

Any subsequent data cleaning (if required) will be up to your personal requirements or use-case.任何后续数据清理（如果需要）将取决于您的个人要求或用例。

Example:例子：

output = []

# Read the matching values into a list.
with open('file2.csv') as f:
    matches = [i.strip() for i in f]

# Iterate over file1 and place any matches into the output.
with open('file1.csv') as f:
    for i in f:
        match = i.split(',')[-1].strip()
        if any(match == j for j in matches):
            output.append(i)

Output:输出：

["'A J1, Jhon1',jhon1@jhon.com, A/B-201 Test1\n",
 "'A J3, Jhon3',jhon3@jhon.com, A/B-203 Test3\n"]

Answer 2

Use sets or dicts for in checks ( complexity is O(1) for them, instead of O(N) for lists and tuples ).使用集合或字典in检查（复杂度为 O(1)，而不是 O(N) 用于列表和元组）。
Have a look at convtools library ( github ): it has Table helper for working with table data as with streams ( table docs )看看convtools库（ github ）：它有Table helper 用于处理表数据和流（ table docs ）

from convtools import conversion as c
from convtools.contrib.tables import Table

# creating a set of allowed values
allowed_values = {
    item[0] for item in Table.from_csv("input2.csv").into_iter_rows(tuple)
}

result = list(
    # reading a file with custom quotechar
    Table.from_csv("input.csv", dialect=Table.csv_dialect(quotechar="'"))
    # stripping last column values
    .update(COLUMN_2=c.col("COLUMN_2").call_method("strip"))
    # filtering based on allowed values
    .filter(c.col("COLUMN_2").in_(c.naive(allowed_values)))
    # returning iterable of tuples
    .into_iter_rows(tuple)

    # # OR outputting csv if needed
    # .into_csv("result.csv")
)
"""
>>> In [36]: result
>>> Out[36]:
>>> [('A J1, Jhon1', 'jhon1@jhon.com', 'A/B-201 Test1'),
>>>  ('A J3, Jhon3', 'jhon3@jhon.com', 'A/B-203 Test3')]
"""

如何在 Python 中比较两个 CSV 文件？

问题描述

2 个解决方案

解决方案1
3 已采纳 2021-07-29 10:20:17

Method 1: `pandas`方法一： `pandas`

Example:例子：

Output:输出：

Method 2: Core Python方法2：核心Python

Example:例子：

Output:输出：

解决方案2
0 2022-07-07 18:23:06

如何在 Python 中比较两个 CSV 文件？

问题描述

2 个解决方案

解决方案1 3 已采纳 2021-07-29 10:20:17

Method 1: pandas方法一： pandas

Example:例子：

Output:输出：

Method 2: Core Python方法2：核心Python

Example:例子：

Output:输出：

解决方案2 0 2022-07-07 18:23:06

解决方案1
3 已采纳 2021-07-29 10:20:17

Method 1: `pandas`方法一： `pandas`

解决方案2
0 2022-07-07 18:23:06