简体   繁体   中英

How to diff the two files using Python Generator

I have one file of 100GB having 1 to 1000000000000 separated by new line. In this some lines are missing like 5, 11, 19919 etc. My Ram size is 8GB.

How to find the missing elements.

My idea take another file for i in range(1,1000000000000) read the lines one by one using the generator . can we use yield statement for this

Can help in writing the code

My Code, the below code taking as a list in does the below code can use it for production.?

def difference(a,b):
    with open(a,'r') as f:
        aunique=set(f.readlines())


    with open(b,'r') as f:
        bunique=set(f.readlines())

    with open('c','a+') as f:
        for line in list(bunique - aunique):
            f.write(line)

If the values are in sequential order, you can simply note the previous value and see if the difference equals one:

prev = 0
with open('numbers.txt','r') as f:
    for line in f:
        value = int(line.strip())
        for i in range(prev, value-1):
            print('missing:', i+1)
    prev = value
# output numbers that are missing at the end of the file (see comment by @blhsing)
for i in range(prev, 1000000000000):
    print('missing:', i+1)

This should work fine in python3, as readlines is an iterator so will not load the full file at once or keep it in memory.

You can iterate over all the numbers generated by range and keep comparing the number to the next number in the file. Output the number if it's missing, or read the next number for the next match:

with open('numbers') as f:
    next_number = 0
    for n in range(1000000000001):
        if n == next_number:
            next_number = int(next(f, 0))
        else:
            print(n)

Demo (assuming target numbers from 1 to 10 instead): https://repl.it/repls/WaterloggedUntimelyCoding

Assume the numbers in the file are already sorted, this is an improved version of @ilmiacs's solution .

def find_missing(f, line_number_ub):
    missing = []
    next_expected = 1
    for i in map(int, f):
        # The logic is correct without if, but adding it can greatly boost the 
        # performance especially when the percentage of missing numbers is small
        if next_expected < i:
            missing += range(next_expected, i)
        next_expected = i + 1
    missing += range(next_expected, line_number_ub)
    return missing

with open(path,'r') as f:
    print(*find_missing(f, 10**12), sep='\n')

If a generator is preferred over a list, you can do

def find_missing_gen(f, line_number_ub):
    missing = []
    next_expected = 1
    for i in map(int, f):
        if next_expected < i:
            yield from range(next_expected, i)
        next_expected = i + 1
    yield from range(next_expected, line_number_ub)

with open(path,'r') as f:
    print(*find_missing_gen(f, 10**12), sep='\n')

And following is some performance measurement using a list of strings from 1 to 9999 with 100 missing values (randomly selected):

(find_missing) 2.35 ms ± 38.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing w/o if) 4.67 ms ± 31.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(@blhsing's solution) 3.54 ms ± 39.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing_gen) 2.35 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
(find_missing_gen w/o if) 4.42 ms ± 14 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You may do some prelimary tests on your machine to see the performance of handling 1GB files in order to estimate whether the performance of handling 100GB files reaches your requirement. If not, you could consider further optimizations such as reading the file in blocks and using more advanced algorithms to find the missing numbers.

Here is my "solution-code":

with open("yourPath.txt", "r") as file:
    distance = 0

    for index, element in enumerate(file):
        element = int(element)  # You don't need it, if your numbers are already intengers
        index += distance
        if index != element:
            distance += element - index
            [print(f"{index + missingNumbers} is missing!") for missingNumbers in range(0, element - index)]

(Short) Explanation
Example case:
Let's say you have this list: [1, 2, 3, 5, 6, 9]
The if-clause is going to be "activated" if it reaches 5!
At this moment element will be 5 and index will be 4 , because index is the number which should be at this index. As you can see index is one number lower than element . As a result index has to be after that always 1 number higher, because index is going to be always 1 number lower than before. And if element is higher than index again ( in this example 9), distance becomes the distance between index and element and all numbers between 6 and 9 are going to printed out.

Note:
My english isn't the best... so feel free to edit it :)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM