track data change in python using hashlib.md5

Question

the problem:

im running the same python script multiple times
some times the input data changes (pulled form some files)
i want to log if there was a change in the data
to do this i hash the data and save the hash code
if the hash is different i know there was a change in the data
in another place i save the connection from file -> hash code

Ive wrote this function to track changes on my data each time i run the script.

    def track_data_change_hash(self, data):
    try:
        import hashlib
        data_hash = hashlib.md5(str(data).encode('utf-8')).hexdigest()
        self.track("the_hash", data_hash[:12])
    except:
        print('failed to create dataset hash')

my problem is that some times the input data can be huge (100GB) and this will fail.

how can i deal with this ? any good ideas ? (thinking of taking the first XMB of the file / input data and just hash that

Answer 1

You need to read the file in chunks of suitable size:

import hashlib


def md5_for_file(your_data, block_size=2048):

    # it means you read 2048 chunk, 2048 chunk step by step ... you can 
    # change it with your own suitable size

    md5 = hashlib.md5()

    while True:
        data = your_data.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

also you can just read some bytes for better RAM performance with seek() and read() functions like this :

with open("1.txt", "rb") as raw_data:
    raw_data.seek(0) 
    output_data = raw_data.read(12)

# it means you just read 12 bytes of file, then you can just hash this part 
# of your own data and check it with your DB ...

Good Luck.

track data change in python using hashlib.md5

Question

1 answers

solution1
0 2017-09-26 20:14:10

track data change in python using hashlib.md5

Question

1 answers

solution1 0 2017-09-26 20:14:10

solution1
0 2017-09-26 20:14:10