speeding up a python code with 2 for loops

Question

I have the following python code snippet that I would like to speed up:

in_file1 = open('file1.txt', 'r')

for line1 in in_file:
    col1,col2,col3,col4=line1.rstrip.split("\t")
    with open('flie2.txt', 'r') as in_file2:
        for line2 in in_file2:
            col1,col2,col3,col4=line2.rstrip.split("\t")
            if col1 in line1 == col1 in line2:
                print the line from file2

File sizes are in GB. Can anyone suggest a way (or ways) to replace the for loops to speed up the code?

Thanks

Answer 1

You could make a set of the first elements from file1 and then test for membership

with open('file1.txt', 'r') as in_file1, open('file2.txt', 'r') as in_file2:
    cols_set = {line.split("\t", 1)[0] for line in in_file1}
    for line in in_file2:
        if line.split("\t",1)[0] in cols_set:
            print(line)

Answer 2

If both memory and performance are an issue then you could use a generator to break the files into manageable sized chunks which are read into memory, and compare each chunk in file 1 to each chunk in file 2. Also note that we can pay attention only to the first column in file 1, so do not have to store the rest of the lines in memory.

def get_chunks(file_name, lines, operation=lambda x:x):
    with open(file_name, 'r') as file:
        chunk = []
        for line in file:
            chunk.append(operation(line))
            if len(chunk) == lines:
                yield chunk
                chunk = []

def get_first_column(line):
    return line.split('\t')[0]

for chunk1 in get_chunks('file1.txt', 10000, operation=get_first_column):
    for chunk2 in get_chunks('file2.txt', 1000):
        for line in chunk2:
            if get_first_column(line) in chunk1:
                print line

Answer 3

def my_func(file_name_1, file_name_2):
    for line_1 in open(file_name_1):
        for line_2 in open(file_name_2):
            if line_1.split('\t')[0] == line_2.split('\t')[0]:
                yield line_2

usage :

for line in my_func('file1.txt', 'file2.txt'):
    print(line)

speeding up a python code with 2 for loops

Question

3 answers

solution1
2 2014-11-16 21:34:57

solution2
1 2014-11-16 22:02:08

solution3
1 2014-11-16 22:57:06

speeding up a python code with 2 for loops

Question

3 answers

solution1 2 2014-11-16 21:34:57

solution2 1 2014-11-16 22:02:08

solution3 1 2014-11-16 22:57:06

solution1
2 2014-11-16 21:34:57

solution2
1 2014-11-16 22:02:08

solution3
1 2014-11-16 22:57:06