简体   繁体   中英

track data change in python using hashlib.md5

the problem:

  • im running the same python script multiple times
  • some times the input data changes (pulled form some files)
  • i want to log if there was a change in the data
  • to do this i hash the data and save the hash code
  • if the hash is different i know there was a change in the data
  • in another place i save the connection from file -> hash code

Ive wrote this function to track changes on my data each time i run the script.

    def track_data_change_hash(self, data):
    try:
        import hashlib
        data_hash = hashlib.md5(str(data).encode('utf-8')).hexdigest()
        self.track("the_hash", data_hash[:12])
    except:
        print('failed to create dataset hash')

my problem is that some times the input data can be huge (100GB) and this will fail.

how can i deal with this ? any good ideas ? (thinking of taking the first XMB of the file / input data and just hash that

You need to read the file in chunks of suitable size:

import hashlib


def md5_for_file(your_data, block_size=2048):

    # it means you read 2048 chunk, 2048 chunk step by step ... you can 
    # change it with your own suitable size

    md5 = hashlib.md5()

    while True:
        data = your_data.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.digest()

also you can just read some bytes for better RAM performance with seek() and read() functions like this :

with open("1.txt", "rb") as raw_data:
    raw_data.seek(0) 
    output_data = raw_data.read(12)

# it means you just read 12 bytes of file, then you can just hash this part 
# of your own data and check it with your DB ...

Good Luck.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM