Memory Error Python When Processing Files

Question

I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. If the md5 sum does not yet exist, then add it to the list. If the sum does already exist, delete the current file.

import sys
import os
import hashlib

def checkFile(fileHashMap, file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

    if fileHash in fileHashMap:
        ### Duplicate file.
        fileHashMap[fileHash].append(file)
        return True
    else:
        fileHashMap[fileHash] = [file]
        return False


def main(argv):
    fileHashMap = {}
    fileCount = 0
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            fileCount += 1
            print("------------: " + str(fileCount))
            print(curDir + file)
            checkFile(fileHashMap, curDir + file)

if __name__ == "__main__":
    main(sys.argv)

The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. How can I calculate the md5 sums without running into this memory error?

Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. Here was the code I ran.

import sys
import os
import hashlib

def checkFile(file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

def main(argv):
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            print("------: " + str(curDir + file))
            checkFile(curDir + file)

if __name__ == "__main__":
    main(sys.argv)

and I still get the memory crash.

Answer 1

Your problem is in reading the entire files, they're too big and your system can't load it all in memory, so then it throws the error.

As you can see in the Official Python Documentation, the MemoryError is:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). The associated value is a string indicating what kind of (internal) operation ran out of memory. Note that because of the underlying memory management architecture (C's malloc() function), the interpreter may not always be able to completely recover from this situation ; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause.

For your purpose, you can use hashlib.md5()

In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function:

def md5(fname):
    hash = hashlib.md5()
    with open(fname) as f:
        for chunk in iter(lambda: f.read(4096), ""):
            hash.update(chunk)
    return hash.hexdigest()

Answer 2

Not a solution to your memory problem, but an optimization that might avoid it:

small files: calculate md5 sum, remove duplicates
big files: remember size and path
at the end, only calculate md5sums of files of same size when there is more than one file

Python's collection.defaultdict might be useful for this.

Answer 3

How about calling openssl command from python In both windows and Linux

$ openssl md5 "file"

Memory Error Python When Processing Files

Question

3 answers

solution1
4 ACCPTED 2015-09-07 16:20:32

solution2
1 2015-09-07 17:29:38

solution3
0 2018-09-14 17:47:57

Memory Error Python When Processing Files

Question

3 answers

solution1 4 ACCPTED 2015-09-07 16:20:32

solution2 1 2015-09-07 17:29:38

solution3 0 2018-09-14 17:47:57

solution1
4 ACCPTED 2015-09-07 16:20:32

solution2
1 2015-09-07 17:29:38

solution3
0 2018-09-14 17:47:57