简体   繁体   中英

Remove Lines From First File Contained In Second File

So I have 2 files, file1 and file2 , of unequal size and at least a million return separated lines each. I want to match content from file1 with file2 and if a match exists, remove the same from file1 . Example:

+------------+-----------+--------------------------+
| file1      | file2     | after processing - file1 |
+------------+-----------+--------------------------+
| google.com | in.com    | google.com               |
+------------+-----------+--------------------------+
| apple.com  | quora.com | apple.com                |
+------------+-----------+--------------------------+
| me.com     | apple.com |                          |
+------------+-----------+--------------------------+

My code looks viz.

with open(file2) as fin:
        exclude = set(line.rstrip() for line in fin)

for line in fileinput.input(file1, inplace=True):
        if line.rstrip() not in exclude:
            print
            line,

Which just deletes all contents of file1 . How can I fix that? Thanks.

Your print statement and its argument are on separate lines. Do print line, instead.

If the working memory is not a problem, I'd suggest a crude solution - load up file2 and then iterate over the file1 writing down the matching lines:

import os
import shutil

FILE1 = "file1"  # path to file1
FILE2 = "file2"  # path to file2

# first load up FILE2 in the memory
with open(FILE2, "r") as f:  # open FILE2 for reading
    file2_lines = {line.rstrip() for line in f}  # use a set for FILE2 for fast matching

# open FILE1 for reading and a FILE1.tmp file for writing
with open(FILE1, "r") as f_in, open(FILE1 + ".tmp", "w") as f_out:
    for line in f_in:  # loop through the FILE1 lines
        if line.rstrip() in file2_lines:  # match found, write to a temporary file
            f_out.write(line)

# finally, overwrite the FILE1 with temporary FILE1.tmp
os.remove(FILE1)
shutil.move(FILE1 + ".tmp", FILE1)

EDIT : Apparently, fileinput.input() is doing pretty much the same so your problem was indeed a typo. Oh well, leaving the answer for posterity as this gives you more control over the whole process.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM