简体   繁体   中英

Finding common lines in 2 different files

I am trying to find common lines in 2 different files and trying to list them in a new text file. I wrote this below but it does not find the commons, only writes whatever the file I gave in the arg2. Please help me to troubleshoot.

#!/usr/bin/python

import sys


def find_common_lines(arg1, arg2, arg3):
    fh1 = open(arg1, 'r+')
    fh2 = open(arg2, 'r+')
    with open(arg3, 'w+') as f:
        for line in fh1 and fh2:
            if line:
                f.write(line)

    fh1.close()
    fh2.close()


number_of_arguments = len(sys.argv) - 1
if number_of_arguments < 3:
    print("ERROR:\tThe script is called with less than 3 arguments, but it needs 3!")
    print("Usage:\tfind_common_lines.py <file1> <file2> <output_filepath>")
else:
    arg1 = sys.argv[1]
    arg2 = sys.argv[2]
    arg3 = sys.argv[3]
    find_common_lines(arg1, arg2, arg3)

So, basically what I want this script to do is:

File A

AAB
BBC
DDE
GGC

File B

123
AAB
DDE
345
GHY
GJK

File C

AAB
DDE

Thanks!!!

first of all, you need to give 2 logical statements when using the "and" operator, right now you are using 1 logical statement and then directly feeding fh2 in the for loop. Try changing the code to something along these lines:

for line in fh1 and fh2:
    if line:
        f.write(line)

to

if line in fh1:
    if line in fh2:
        f.write(line)

You can use python's library pandas for this:

Create dataframes for each .txt file like below:

In [2017]: df_A = pd.read_fwf('/home/mayankp/Documents/Personal/stackoverflow/A.txt', header=None)

In [2018]: df_A
Out[2018]: 
     0
0  AAB
1  BBC
2  DDE
3  GGC

In [2019]: df_B = pd.read_fwf('/home/mayankp/Documents/Personal/stackoverflow/B.txt', header=None)

In [2020]: df_B
Out[2020]: 
     0
0  123
1  AAB
2  DDE
3  345
4  GHY
5  GJK

Now, merge both dataframes(like inner join) to find out only common rows between the both.

In [2021]: df_C = pd.merge(df_A, df_B, on=0, how='inner')
Out[2021]: df_C
     0
0  AAB
1  DDE

Then, you can write this output in a file like below:

In [2023]: df_C.to_csv('out.csv', index=False)

This will be efficient as no loops are required, also, no complex regex are required to be written. Code becomes cleaner and simpler.

Let me know if this helps.

Try using dictionary:

import sys
def find_common_lines(arg1, arg2, arg3):
    alllines_dict = {}
    with open(arg1, 'r') as f:
        while True:
            line = f.readline()
            if not line:
                break
            alllines_dict[line.strip()] = 1
    with open(arg3, 'w') as out:
        with open(arg2, 'r') as f:
            while True:
                line2 = f.readline()
                if not line2:
                    break
                line2 = line2.strip()
                ispresent = alllines_dict.get(line2, None)
                if ispresent is not None:
                    out.write(line2 + '\n')
number_of_arguments = len(sys.argv)-1
print(sys.argv)
if number_of_arguments < 3:
    print("ERROR:\tThe script is called with less than 3 arguments, but it needs 3!")
    print("Usage:\tfind_common_lines.py <file1> <file2> <output_filepath>")
else:
    arg1 = sys.argv[1]
    arg2 = sys.argv[2]
    arg3 = sys.argv[3]
    find_common_lines(arg1, arg2, arg3)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM