简体   繁体   中英

How to delete a line in file1 that appeared once or multiple times in file2 in python?

I have two text files: file1 has 40 lines and file2 has 1.3 million lines I would like to compare every line in file1 with file2. If a line in file1 appeared once or multiple times in file2, this line(lines) should be deleted from file2 and remaining lines of file2 return to a third file3. I could painfully delete one line in file1 from file2 by manually copying the line, indicated as "unwanted_line" in my code. Does anyone knows how to do this in python. Thanks in advance for your assistance. Here's my code:

       fname = open(raw_input('Enter input filename: ')) #file2

       outfile = open('Value.txt','w')

       unwanted_line = "222" #This is in file1

       for line in fname.readlines(): 
           if not unwanted_line in line:
       # now remove unwanted_line from  fname
          data =line.strip("unwanted_line")

         # write it to the output file
         outfile.write(data)

       print 'results written to:\n', os.getcwd()+'\Value.txt'  

NOTE:

This is how I got it to work for me. I would like to thank everyone who contributed towards the solution. I took your ideas here.I used set(),where intersection (common lines) of file1 with file2 is removed, then, the unique lines in file2 are return to file3. It might not be most elegant way of doing it, but it works for me. I respect everyone of your ideas, there are great and wonderful, it makes me feel python is the only programming language in the whole world. Thanks guys.

        def diff_lines(filenameA,filenameB):        
            fnameA = set(filenameA)
            fnameB = set(filenameB)
            data = []

            #identify lines not common to both files  
            #diff_line = fnameB ^ fnameA
            diff_line = fnameA.symmetric_difference(fnameB)
            data = list(diff_line)
            data.sort() 
            return data     

Read file1; put the lines into a set or dict (it'll have to be a dict if you're using a really old version of Python); now go through file2 and say something like if line not in things_seen_in_file_1: outfile.write(line) for each line.

Incidentally, in recent Python versions you shouldn't bother calling readlines : an open file is an iterator and you can just say for line in open(filename2): ... to process each line of the file.

Here is my version, but be aware that miniscule variations can cause line not to be considered same (like one space before new line).

file1, file2, file3 = 'verysmalldict.txt', 'uk.txt', 'not_small.txt'    
drop_these = set(open(file1))
with open(file3, 'w') as outfile:
    outfile.write(''.join(line for line in open(file2) if line not in drop_these))
with open(path1) as f1:
    lines1 = set(f1)
with open(path2) as f2:
    lines2 = tuple(f2)

lines3 = x for x in lines2 if x in lines1
lines2 = x for x in lines2 if x not in lines1

with open(path2, 'w') as f2:
    f2.writelines(lines2)
with open(path3, 'w') as f3:
    f3.writelines(lines3)

Closing f2 by using 2 separate with statements is a matter of personal preference/design choice.

what you can do is load file1 completely into memory (since it is small) and check each line in file2 if it matches a line in file1. if it doesn't then write it to file three. Sort of like this:

file1 = open('file1')
file2 = open('file2')
file3 = open('file3','w')

lines_from_file1 = []
# Read in all lines from file1
for line in file1:
    lines_from_file1.append(line)
file1.close()

# Now iterate over lines of file2
for line2 in file2:
    keep_this_line = True
    for line1 in lines_from_file1:
        if line1 == line2:
            keep_this_line = False
            break # break out of inner for loop
    if keep_this_line:
        # line from file2 is not in file1 so save it into file3
        file3.write(line2) 

file2.close()
file3.close()

Maybe not the most elegant solution, but if you don't have to do it ever 3 seconds, it should work.

EDIT: By the way, the question in the text somewhat differs from the title. I tried to answer the question in the text.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM