简体   繁体   中英

Compare two csv files

I am trying to compare two csv files to look for common values in column 1.

import csv

f_d1 = open('test1.csv')
f_d2 = open('test2.csv')

f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)

for x,y in zip(f1_csv,f2_csv):
    print(x,y)

I am trying to compare x[0] with y[0]. I am fairly new to python and trying to find the most pythonic way to achieve the results. Here is the csv files.

test1.csv

Hadrosaurus,1.2
Struthiomimus,0.92
Velociraptor,1.0
Triceratops,0.87
Euoplocephalus,1.6
Stegosaurus,1.4
Tyrannosaurus Rex,2.5

test2.csv

Euoplocephalus,1.87
Stegosaurus,1.9
Tyrannosaurus Rex,5.76
Hadrosaurus,1.4
Deinonychus,1.21
Struthiomimus,1.34
Velociraptor,2.72

I believe you're looking for the set intersection:

import csv

f_d1 = open('test1.csv')
f_d2 = open('test2.csv')

f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)

x = set([item[0] for item in f1_csv])
y = set([item[0] for item in f2_csv])

print(x & y)

I added a line to test whether the numerical values in each row are the same. You can modify this to test whether, for instance, the values are within some distance of each other:

import csv

f_d1 = open('test1.csv')
f_d2 = open('test2.csv')

f1_csv = csv.reader(f_d1)
f2_csv = csv.reader(f_d2)

for x,y in zip(f1_csv,f2_csv):
    if x[1] == y[1]:
        print('they match!')

Take advantage of the defaultdict in Python and you can iterate both the files and maintain the count in a dictionary like this

from collections import defaultdict
d = defaultdict(list)

for row in f1_csv:
    d[row[0]].append(row[1])

for row in f2_csv:
    d[row[0]].append(row[1])

d = {k: d[k] for k in d if len(d[k]) > 1}

print(d)

Output:

    {'Hadrosaurus': ['1.2', '1.4'], 'Struthiomimus': ['0.92', '1.34'], 'Velociraptor': ['1.0', '2.72'], 
'Euoplocephalus': ['1.6', '1.87'], 'Stegosaurus': ['1.4', '1.9'], 'Tyrannosaurus Rex': ['2.5', '5.76']}

Assuming that the files are not prohibitively large, you can read both of them with a CSV reader, convert the first columns to sets, and calculate the set intersection:

with open('test1.csv') as f:
   set1 = set(x[0] for x in csv.reader(f))
with open('test2.csv') as f:
   set2 = set(x[0] for x in csv.reader(f))
print(set1 & set2)
#{'Hadrosaurus', 'Euoplocephalus', 'Tyrannosaurus Rex', 'Struthiomimus', 
#  'Velociraptor', 'Stegosaurus'}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM