I have the following python code snippet that I would like to speed up:
in_file1 = open('file1.txt', 'r')
for line1 in in_file:
col1,col2,col3,col4=line1.rstrip.split("\t")
with open('flie2.txt', 'r') as in_file2:
for line2 in in_file2:
col1,col2,col3,col4=line2.rstrip.split("\t")
if col1 in line1 == col1 in line2:
print the line from file2
File sizes are in GB. Can anyone suggest a way (or ways) to replace the for loops to speed up the code?
Thanks
You could make a set of the first elements from file1 and then test for membership
with open('file1.txt', 'r') as in_file1, open('file2.txt', 'r') as in_file2:
cols_set = {line.split("\t", 1)[0] for line in in_file1}
for line in in_file2:
if line.split("\t",1)[0] in cols_set:
print(line)
If both memory and performance are an issue then you could use a generator to break the files into manageable sized chunks which are read into memory, and compare each chunk in file 1 to each chunk in file 2. Also note that we can pay attention only to the first column in file 1, so do not have to store the rest of the lines in memory.
def get_chunks(file_name, lines, operation=lambda x:x):
with open(file_name, 'r') as file:
chunk = []
for line in file:
chunk.append(operation(line))
if len(chunk) == lines:
yield chunk
chunk = []
def get_first_column(line):
return line.split('\t')[0]
for chunk1 in get_chunks('file1.txt', 10000, operation=get_first_column):
for chunk2 in get_chunks('file2.txt', 1000):
for line in chunk2:
if get_first_column(line) in chunk1:
print line
def my_func(file_name_1, file_name_2):
for line_1 in open(file_name_1):
for line_2 in open(file_name_2):
if line_1.split('\t')[0] == line_2.split('\t')[0]:
yield line_2
usage :
for line in my_func('file1.txt', 'file2.txt'):
print(line)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.