the problem:
Ive wrote this function to track changes on my data each time i run the script.
def track_data_change_hash(self, data):
try:
import hashlib
data_hash = hashlib.md5(str(data).encode('utf-8')).hexdigest()
self.track("the_hash", data_hash[:12])
except:
print('failed to create dataset hash')
my problem is that some times the input data can be huge (100GB) and this will fail.
how can i deal with this ? any good ideas ? (thinking of taking the first XMB of the file / input data and just hash that
You need to read the file in chunks of suitable size:
import hashlib
def md5_for_file(your_data, block_size=2048):
# it means you read 2048 chunk, 2048 chunk step by step ... you can
# change it with your own suitable size
md5 = hashlib.md5()
while True:
data = your_data.read(block_size)
if not data:
break
md5.update(data)
return md5.digest()
also you can just read some bytes for better RAM performance with seek() and read() functions like this :
with open("1.txt", "rb") as raw_data:
raw_data.seek(0)
output_data = raw_data.read(12)
# it means you just read 12 bytes of file, then you can just hash this part
# of your own data and check it with your DB ...
Good Luck.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.