简体   繁体   中英

Optimize searching two text files and output based upon a third using Python

I'm having performance issues with a python function that I'm using loading two 5+ GB tab delineated txt files that are the same format with different values and using a third text file as a key to determine which values should be kept for output. I'd like some help for speed gains if possible.

Here is the code:

def rchfile():
# there are 24752 text lines per stress period, 520 columns, 476 rows
# there are 52 lines per MODFLOW model row
lst = []
out = []
tcel = 0
end_loop_break = False

# key file that will set which file values to use. If cell address is not present or value of cellid = 1 use
# baseline.csv, otherwise use test_p97 file.
with open('input/nrd_cells.csv') as csvfile:
    reader = csv.reader(csvfile)
    for item in reader:
        lst.append([int(item[0]), int(item[1])])

# two files that are used for data
with open('input/test_baseline.rch', 'r') as b, open('input/test_p97.rch', 'r') as c:
    for x in range(3):  # skip the first 3 lines that are the file header
        b.readline()
        c.readline()

    while True:  # loop until end of file, this should loop here 1,025 times
        if end_loop_break == True: break
        for x in range(2):  # skip the first 2 lines that are the stress period header
            b.readline()
            c.readline()

        for rw in range(1, 477):
            if end_loop_break == True: break

            for cl in range(52):
                # read both files at the same time to get the different data and split the 10 values in the row
                b_row = b.readline().split()
                c_row = c.readline().split()

                if not b_row:
                    end_loop_break == True
                    break

                for x in range(1, 11):
                    # search for the cell address in the key file to find which files datat to keep
                    testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]

                    if not testval:  # cell address not in key file
                        out.append(b_row[x - 1])
                    elif lst[testval[0]][1] == 1:  # cell address value == 1
                        out.append(b_row[x - 1])
                    elif lst[testval[0]][1] == 2:  # cell address value == 2
                        out.append(c_row[x - 1])

                    print(cl * 10 + x + tcel)  # test output for cell location

            tcel += 520

print('success')`

The key file looks like:

37794, 1
37795, 0
37796, 2

The data files are large ~5GB each and complex from a counting standpoint, but are standard in format and look like:

0    0    0    0    0    0    0    0    0    0
1.5  1.5  0    0    0    0    0    0    0    0

This process is taking a very long time and was hoping someone could help speed it up.

I believe your speed problem is coming from this line:

testval = [i for i, xi in enumerate(lst) if xi[0] == cl * 10 + x + tcel]

You are iterating over the whole key list for every single value in the HUGE output files. This is not good.

It looks like cl * 10 + x + tcel is the formula you are looking for in lst[n][0] .

May I suggest you use a dict instead of a list for storing the data in lst .

lst = {}
for item in reader:
   lst[int(item[0])] = int(item[1])

Now, lst is a mapping, which means you can simply use the in operator to check for the presence of a key. This is a near instant lookup because the dict type is hash based and very efficient for key lookups.

something in lst
# for example
(cl * 10 + x) in lst

And you can grab the value by:

lst[something] 
#or
lst[cl * 10 + x]

A little bit of refactoring and your code should PROFOUNDLY speed up.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM