简体   繁体   English

比较两个csv文件

[英]Compare two csv files

I am trying to compare two csv files to look for common values in column 1. 我正在尝试比较两个csv文件以在第1列中查找通用值。

import csv

f_d1 = open('test1.csv')
f_d2 = open('test2.csv')

f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)

for x,y in zip(f1_csv,f2_csv):
    print(x,y)

I am trying to compare x[0] with y[0]. 我正在尝试将x [0]与y [0]进行比较。 I am fairly new to python and trying to find the most pythonic way to achieve the results. 我对python还是相当陌生,并试图找到最pythonic的方式来实现结果。 Here is the csv files. 这是csv文件。

test1.csv test1.csv

Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5

test2.csv test2.csv

Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72

I believe you're looking for the set intersection: 我相信您正在寻找设定的交集:

import csv

f_d1 = open('test1.csv')
f_d2 = open('test2.csv')

f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)

x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])

print(x & y)

I added a line to test whether the numerical values in each row are the same. 我添加了一行以测试每行中的数值是否相同。 You can modify this to test whether, for instance, the values are within some distance of each other: 您可以修改此值以测试例如值之间是否在一定距离之内:

import csv

f_d1 = open('test1.csv')
f_d2 = open('test2.csv')

f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)

for x,y in zip(f1_csv,f2_csv):
    if x[1] == y[1]:
        print('they match!')

Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this 利用Python中的defaultdict ,您可以迭代两个文件并在这样的字典中维护计数

from collections import defaultdict
d = defaultdict(list)

for row in f1_csv:
    d[row[0]].append(row[1])

for row in f2_csv:
    d[row[0]].append(row[1])

d = {k: d[k] for k in d if len(d[k]) > 1}

print(d)

Output: 输出:

    {'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'], 
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}

Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection: 假设文件不是很大,您可以使用CSV阅读器读取它们,将第一列转换为集合,然后计算集合交集:

with open('test1.csv') as f:
   set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
   set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus', 
#  'Velociraptor', 'Stegosaurus'}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM