简体   繁体   中英

Most efficient way to compare multiple files in python

My problem is this. I have one file with 3000 lines and 8 columns(space delimited). The important thing is that the first column is a number ranging from 1 to 22. So in the principle of divide-n-conquer I splitted the original file in to 22 subfiles depending on the first column value.

And I have some result files. Which are 15 sources each containing 1 result file. But the result file is too big, so I applied divide-n-conquer once more to split each of the 15 results in to 22 subfiles.

the file structure is as follows:

Original_file                Studies
    split_1                      study1
                                     split_1, split_2, ...
    split_2                      study2
                                     split_1, split_2, ...
    split_3                      ...
    ...                          study15
                                     split_1, split_2, ...
    split_22

So by doing this, we pay a slight overhead in the beginning, but all of these split files will be deleted at the end. so it doesn't really matter.

I need my final output to be the original file with some values from the studies appended to it.

So, my take so far is this:

Algorithm:
    for i in range(1,22):
        for j in range(1,15)
            compare (split_i of original file) with the jth studys split_i
            if one value on a specific column matches:
                create a list with needed columns from both files, split row with ' '.join(list) and write the result in outfile.

Is there a better way to go around this problem? Because the study files range from 300MB to 1.5GB in size.

and here's my Python code so far:

folders = ['study1', 'study2', ..., 'study15']
with open("Effects_final.txt", "w") as outfile:
    for i in range(1, 23):
        chr = i
        small_file = "split_"+str(chr)+".txt"
        with open(small_file, 'r') as sf:
            for sline in sf: #small_files
                sf_parts = sline.split(' ')
                for f in folders:
                    file_to_compare_with = f + "split_" + str(chr) + ".txt"
                    with open(file_to_compare_with, 'r') as cf: #comparison files
                        for cline in cf:
                            cf_parts = cline.split(' ')
                            if cf_parts[0] == sf_parts[1]:
                               to_write = ' '.join(cf_parts+sf_parts) 
                               outfile.write(to_write)

But this code uses 4 loops which is an overkill, but you have to do it since you need to read the lines from the 2 files being compared at the same time. This is my concern...

I found one solution that seems to work in a good amount of time. The code is the following:

with open("output_file", 'w') as outfile:
    for i in range(1,23):
        dict1 = {}  # use a dictionary to map values from the inital file
        with open("split_i", 'r') as split:
            next(split) #skip the header
            line_list = line.split(delimiter)
            for line in split:
                dict1[line_list[whatever_key_u_use_as_id]] = line_list

            compare_dict = {}
            for f in folders:
                with open("each folder", 'r') as comp:
                    next(comp) #skip the header
                    for cline in comp:
                        cparts = cline.split('delimiter')
                        compare_dict[cparts[whatever_key_u_use_as_id]] = cparts
            for key in dict1:
                if key in compare_dict:
                    outfile.write("write your data")
outfile.close()

With this approach, I'm able to compute this dataset in ~10mins. Surely, there are ways for improvement. One idea, is to take the time and sort the datasets, that way search later on will be more quick, and we might save time!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM