How to check for reversed order string tuple and eliminate them from a large text file (>60GB) file in python?

Question

I have the following code to do the finding of reversed order string tuple and eliminate them from a text file. But it is taking an infinite amount of time for large text files (>60GB) and my system crashes.

with open("OUTPUT.txt.txt", "w") as output:
    for fileName in ["Large_INPUT.txt"]:
        found_combinations = set()
        with open(fileName, 'r') as file1:
            for line in file1:
                cols = [col.strip() for col in line.strip().split('\t')]
                new_combination = frozenset(cols)
                if new_combination not in found_combinations:
                    found_combinations.add(new_combination)
                    out = ', '.join(cols) + '\n'
                    output.write(out)

For example, if the input is:

352_0F, 352_1F,  0.913
352_1F, 352_0F,  0.913

The expected output is:

352_0F, 352_1F,  0.913

Is there a way to optimize this code for large files?

Answer 1

For the program crash I suspect that your set is busting its maximum capacity. With 22 character lines, a 65GB file will generate close to 3 billion entries in the set (assuming you don't have a large proportion of duplicate lines). This will simply not fit on a 32 bit system (so make sure you're running in 64 bits with lots of memory)

For the performance, you should check if the process is CPU bound or I/O bound. Reading large files one line at a time may take a long time even without doing any processing. Take a file that doesn't make the program crash (or cut one down for testing) and measure the time it takes to simply input one file and output the same file (without any filtering). That will be the minimum time you can get with line by line processing. If that is close to the time it takes when filtering, then you have an I/O problem. Making sure you are on SSD storage would be a first step. You could also try the solution proposed here (although I'm not sure it would apply to your environment): https://stackoverflow.com/a/60571361/5237560

With only 3 values in the list, there is the opportunity to use a dictionary to break down the sets into smaller objects. Assuming that the tuple reversal only occurs between the first two values, you could group sets by the third value. This would limit the set size (assuming you have a large variety of these values)

For example:

from collections import defaultdict

...
for line in file1:
    seenTuples = defaultdict(set)
    code1,code2,value = [col.strip() for col in line.strip().split('\t')]
    if code1>code2 : code1,code2 = code2,code1
    if (code,code2) in seenTuples[value]: continue # skip this line
    seenTuples[value].add((code1,code2))
    output.write(line)

How to check for reversed order string tuple and eliminate them from a large text file (>60GB) file in python?

Question

1 answers

solution1
1 2020-03-12 21:47:24

How to check for reversed order string tuple and eliminate them from a large text file (>60GB) file in python?

Question

1 answers

solution1 1 2020-03-12 21:47:24

solution1
1 2020-03-12 21:47:24