简体   繁体   中英

How to compare 2 txt files in Python

I have written a program to compare file new1.txt with new2.txt and the lines which are there in new1.txt and not in new2.txt has to be written to difference.txt file .

Can someone please have a look and let me know what changes are required in the below given code. The code prints the same value multiple times.

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')    
NewFile = open("difference.txt",'w')   
for line1 in file1:    
    for line2 in file2:    
        if line2 != line1:    
            NewFile.write(line1)    
file1.close()    
file2.close()
NewFile.close()

Here's an example using the with statement, supposing the files are not too big to fit in the memory

# Open 'new1.txt' as f1, 'new2.txt' as f2 and 'diff.txt' as outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:

    # Read the lines from 'new2.txt' and store them into a python set
    lines = set(f2.readlines())

    # Loop through each line in 'new1.txt'
    for line in f1:

        # If the line was not in 'new2.txt'
        if line not in lines:

            # Write the line to the output file
            outf.write(line)

The with statement simply closes the opened file(s) automatically. These two pieces of code are equal:

with open('temp.log') as temp:
    temp.write('Temporary logging.')

# equal to:

temp = open('temp.log')
temp.write('Temporary logging.')
temp.close()

Yet an other way using two set s, but this again isn't too memory effecient. If your files are big, this wont work:

# Again, open the three files as f1, f2 and outf
with open('new1.txt') as f1, open('new2.txt') as f2, open('diff.txt', 'w') as outf:

    # Read the lines in 'new1.txt' and 'new2.txt'
    s1, s2 = set(f1.readlines()), set(f2.readlines())

    # `s1 - s2 | s2 - s2` returns the differences between two sets
    # Now we simply loop through the different lines
    for line in s1 - s2 | s2 - s1:

        # And output all the different lines
        outf.write(line)

Keep in mind, that this last code might not keep the order of your lines

For example you got file1: line1 line2

and file2: line1 line3 line4

When you compare line1 and line3, you write to your output file new line (line1), then you go to compare line1 and line4, again they do not equal, so again you print into your output file (line1)... You need to break both for s, if your condition is true. You can use some help variable to break outer for.

It is because of your for loops.

If I understand well, you want to see what lines in file1 are not present in file2.

So for each line in file1, you have to check if the same line appears in file2. But this is not what you do with your code : for each line in file1, you check every line in file2 (this is right), but each time the line in file2 is different from the line if file1, you print the line in file1! So you should print the line in file1 only AFTER having checked ALL the lines in file2, to be sure the line does not appear at least one time.

It could look like something as below:

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')
NewFile = open("difference.txt",'w')

for line1 in file1:
    if line1 not in file2:
        NewFile.write(line1)

file1.close()
file2.close()
NewFile.close()

If your file is a big one .You could use this. for-else method :

the else method below the second for loop is executes only when the second for loop completes it's execution with out break that is if there is no match

Modification:

with open('new1.txt') as file1,  open('diff.txt', 'w') as NewFile :  
    for line1 in file1:    
       with open('new2.txt') as file2:
           for line2 in file2:    
               if line2 == line1: 
                   break
           else:
               NewFile.write(line1) 

For more on for-else method see this stack overflow question for-else

I always find working with sets makes comparison of two collections easier. Especially because"does this collection contain this" operations runs i O(1), and most nested loops can be reduced to a single loop (easier to read in my opinion).

with open('test1.txt') as file1, open('test2.txt') as file2, open('diff.txt', 'w') as diff:
    s1 = set(file1)
    s2 = set(file2)
    for e in s1:
        if e not in s2:
            diff.write(e)

Your loop is executed multiple times. To avoid that, use this:

file1 = open("new1.txt",'r')        
file2 = open("new2.txt",'r')    
NewFile = open("difference.txt",'w')
for line1, line2 in izip(file1, file2):    
        if line2 != line1:    
            NewFile.write(line1)
file1.close()    
file2.close()
NewFile.close()

Print to the NewFile, only after comparing with all lines of file2

present = False
for line2 in file2:    
    if line2 == line1:
        present = True
if not present:
    NewFile.write(line1)   

You can use basic set operations for this:

with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
    diffs.writelines(set(f1).difference(f2))

According to this reference , this will execute with O(n) where n is the number of lines in the first file. If you know that the second file is significantly smaller than the first you can optimise with set.difference_update() . This has complexity O(n) where n is the number of lines in the second file. For example:

with open('new1.txt') as f1, open('new2.txt') as f2, open('diffs.txt', 'w') as diffs:
    s = set(f1)
    s.difference_update(f2)
    diffs.writelines(s)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM