简体   繁体   中英

Python : Compare two large files

This is follow up question to Compare two large files which is answerd by phihag

I want to display the count of lines which are different after comparing two files. Want to display if after program completion as a message by saying count of lines are in difference .

My try :

with open(file2) as b:
  blines = set(b)
with open(file1) as a:
  with open(file3, 'w') as result:
    for line in a:
      if line not in blines:
        result.write(line)

lines_to_write = []
with open(file2) as b:
  blines = set(b)
with open(file1) as a:
  lines_to_write = [l for l in a if l not in blines]

print('count of lines are in difference:', len(lines_to_write))

If you can load everything into memory, you can perform the following operations on sets:

union = set(alines).union(blines)
intersection = set(alines).intersection(blines)
unique = union - intersection

EDIT: Even simpler (and faster) is:

set(alines).symmetric_difference(blines)

edit : This answer assumes you want to compare corresponding lines from the two files. If that's not what you want, ignore this answer. I'll leave it here for future readers.


If you just want the count of the lines, avoid creating large lists. Files are memory efficient iterators, and your task does not require more memory than is needed to look at two lines at once.

Demo (with two fake files)

>>> fake_file_1 = '''1
... 2
... 3'''.splitlines()
>>> 
>>> fake_file_2 = '''1
... 1
... 3
... 4'''.splitlines()

I am assuming that you want the answer 2 here, because the second lines differ and fake_file_2 has an additional fourth line.

>>> from itertools import zip_longest # izip_longest in Python2
>>> sum(1 for line1, line2 in zip_longest(fake_file_1, fake_file_2, fillvalue=float('nan')) 
...     if line1 != line2)
2

zip_longest works like zip and will yield pairs of corresponding lines from the two files. In addition, if one file is longer, the fillvalue float('nan') is inserted, which always compares unequal to anything (of course, you could just use any other dummy value like 0 , but I like it this way).

Instead of the fake files, just use the handles of your actual opened files.

I propose a solution based on pandas.

import pandas as pd

1. Create two pandas dataframes

df1 = pd.read_csv(filepath_1)
df2 = pd.read_csv(filepath_2)

2. For the case that your sentences contain any potential delimiters, join all columns to one

df1 = df1.astype(str).apply(''.join)
df2 = df2.astype(str).apply(''.join)

3. Concat both frames to one

frames = [df1, df2]
df_merged = pd.concat(frames)

4. Drop both copies of all duplicates

df_unique = df_merged.drop_duplicates(keep = False)

5. Count and print result

print('count of lines are in difference:', len(df_unique))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM