[英]How to compare two CSV files in Python?
I've two CSV file named as file1.csv and file2.csv in file2.csv there is only one column which contain only five records and in file1.csv I've three column which contain more than thousand records I want to get those records which contain in file2.csv for example this is my file1.csv我在file2.csv中有两个名为file1.csv和file2.csv的CSV文件,只有一列只包含五条记录,而在file1.csv我有三列,其中包含超过一千条记录我想得到那些包含在file2.csv中的记录,例如这是我的file1.csv
'A J1, Jhon1',jhon1@jhon.com, A/B-201 Test1
'A J2, Jhon2',jhon2@jhon.com, A/B-202 Test2
'A J3, Jhon3',jhon3@jhon.com, A/B-203 Test3
'A J4, Jhon4',jhon4@jhon.com, A/B-204 Test4
.......and more records
and inside my file2.csv I've only five records right now but in future it can be many在我的file2.csv 中,我现在只有五条记录,但将来可能会有很多
A/B-201 Test1
A/B-2012 Test12
A/B-203 Test3
A/B-2022 Test22
so I've to find records from my file1.csv
at index[2] or index[-1]所以我必须在index[2]或index[-1]从我的
file1.csv
中找到记录
this is what I did but it not giving me any output it just returning empty list这就是我所做的,但它没有给我任何输出它只是返回空列表
import csv
file1 = open('file1.csv','r')
file2 = open('file2.csv','r')
f1 = list(csv.reader(file1))
f2 = list(csv.reader(file2))
new_list = []
for i in f1:
if i[-1] in f2:
new_list.append(i)
print('New List : ',new_list)
it gives me output like this它给了我这样的输出
New List : []
Please help if I did any thing wrong correct me.如果我做错了什么,请帮助纠正我。
pandas
pandas
This task can be done with relative ease using pandas
.使用
pandas
可以相对轻松地完成此任务。 DataFrame documentation here . DataFrame 文档在这里。
In the example below, the two CSV files are read into two DataFrames.在下面的示例中,两个 CSV 文件被读入两个 DataFrame。 The DataFrames are merged using an inner join on the matching columns.
DataFrame 使用匹配列上的内部连接进行合并。
The output shows the merged result.输出显示合并的结果。
import pandas as pd
df1 = pd.read_csv('file1.csv', names=['col1', 'col2', 'col3'], quotechar="'", skipinitialspace=True)
df2 = pd.read_csv('file2.csv', names=['match'])
df = pd.merge(df1, df2, left_on=df1['col3'], right_on=df2['match'], how='inner')
The quotechar
and skipinitialspace
parameters are used as the first column in file1
is quoted and contains a comma, and there is leading whitespace after the comma before the last field. quotechar
和skipinitialspace
参数用作file1
中的第一列被引用并包含逗号,并且在最后一个字段之前的逗号之后有前导空格。
col1 col2 col3
0 A J1, Jhon1 jhon1@jhon.com A/B-201 Test1
1 A J3, Jhon3 jhon3@jhon.com A/B-203 Test3
If you choose, the output can easily be written back to a CSV file as:如果您选择,可以将输出轻松写回 CSV 文件,如下所示:
df.to_csv('path/to/output.csv')
For other DataFrame operations, refer to the documentation linked above.有关其他 DataFrame 操作,请参阅上面链接的文档。
The method below does not use any libraries, only core Python.下面的方法不使用任何库,只使用核心 Python。
file2
into a list.file2
中的匹配项读入列表。file1
and search each line to determine if the last value is a match for an item in file2
.file1
并搜索每一行以确定最后一个值是否与file2
中的项目匹配。 Any subsequent data cleaning (if required) will be up to your personal requirements or use-case.任何后续数据清理(如果需要)将取决于您的个人要求或用例。
output = []
# Read the matching values into a list.
with open('file2.csv') as f:
matches = [i.strip() for i in f]
# Iterate over file1 and place any matches into the output.
with open('file1.csv') as f:
for i in f:
match = i.split(',')[-1].strip()
if any(match == j for j in matches):
output.append(i)
["'A J1, Jhon1',jhon1@jhon.com, A/B-201 Test1\n",
"'A J3, Jhon3',jhon3@jhon.com, A/B-203 Test3\n"]
in
checks ( complexity is O(1) for them, instead of O(N) for lists and tuples ).in
检查(复杂度为 O(1),而不是 O(N) 用于列表和元组)。Table
helper for working with table data as with streams ( table docs )Table
helper 用于处理表数据和流( table docs )from convtools import conversion as c
from convtools.contrib.tables import Table
# creating a set of allowed values
allowed_values = {
item[0] for item in Table.from_csv("input2.csv").into_iter_rows(tuple)
}
result = list(
# reading a file with custom quotechar
Table.from_csv("input.csv", dialect=Table.csv_dialect(quotechar="'"))
# stripping last column values
.update(COLUMN_2=c.col("COLUMN_2").call_method("strip"))
# filtering based on allowed values
.filter(c.col("COLUMN_2").in_(c.naive(allowed_values)))
# returning iterable of tuples
.into_iter_rows(tuple)
# # OR outputting csv if needed
# .into_csv("result.csv")
)
"""
>>> In [36]: result
>>> Out[36]:
>>> [('A J1, Jhon1', 'jhon1@jhon.com', 'A/B-201 Test1'),
>>> ('A J3, Jhon3', 'jhon3@jhon.com', 'A/B-203 Test3')]
"""
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.