简体   繁体   中英

How to compare 2 files to get only lines that are not in the second file?

I have 2 files like these :

file 1 :                     file 2 :
col1    col2                 col1     col2
john    kerry                john     kerry
adam    lord                 bob      abram  
joe     hitch               

I would like to compare those two files based on lastnames and firstnames to get only a file that does not contain the people in file 2, that is to say :

desired output file :

col1     col2
adam     lord
joe      hitch

I tried this but I don't get the right output

import csv

reader1=csv.reader(open('file1.csv', 'r'), delimiter='\t')
reader2=csv.reader(open('file2.csv', 'r'), delimiter='\t')
writer=csv.writer(open('desired_file.csv', 'w'), delimiter=',')

row1 = reader1.next()
row2 = reader2.next()

if (row1[0] == row2[0]) and (row1[1] == row2[1]):
    print 'equal'
else:
    writer.writerow(row1)
    writer.writerow(row2)

I'd use a set difference:

with open('file1') as f1, open('file2') as f2:
    data1 = set(f1)
    lines_not_in_f2 = data1.difference(f2)

If the formatting of the files can be slightly different, you might need to wrap the file objects in a generator which yields tuples:

def people(my_file):
    for line in myfile:
        yield tuple(x.lower() for x in line.split())

with open('file1') as f1, open('file2') as f2:
    data1 = set(people(f1))
    people_not_in_f2 = data1.difference(people(f2))

This has the advantage that you don't need to read the entire f2 file into memory. It has the disadvantage that the output names are unordered (since they are stored in a set).

I think you do not need the csv module if the file formats are the same. How about this solution:

exclude_names = frozenset(open('file2')) # make set for performance
with open('output', 'w') as f:
    for name in open('file1'):
        if name not in exclude_names:
             f.write(name)

Solution with csv reader/writer:

import csv

exclude_names = frozenset(csv.reader(open('file2.csv', 'r'), delimiter='\t'))    
with open('desired_file.csv', 'w') as f:
    writer = csv.writer(f, delimiter=',')
    for row in csv.reader(open('file1', 'r'), delimiter='\t'):
         if row not in exclude_names:
              writer.writerow(row)
results=[i for i, j in zip(reader1, reader2) if i != j]

or use set(reader1) - set(reader2) if the order is not important.

myfile = open(..., 'wb')
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(results)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM