简体   繁体   English

处理文件时内存错误Python

[英]Memory Error Python When Processing Files

I have a backup hard drive that I know has duplicate files scattered around and I decided it would be a fun project to write a little python script to find them and remove them. 我有一个备份硬盘驱动器,我知道有重复的文件分散在我周围,我认为这是一个有趣的项目,写一个小的python脚本来找到它们并删除它们。 I wrote the following code just to traverse the drive and calculate the md5 sum of each file and compare it to what I am going to call my "first encounter" list. 我编写以下代码只是为了遍历驱动器并计算每个文件的md5总和,并将其与我将称之为“第一次遇到”列表的内容进行比较。 If the md5 sum does not yet exist, then add it to the list. 如果md5总和尚不存在,则将其添加到列表中。 If the sum does already exist, delete the current file. 如果总和已存在,则删除当前文件。

import sys
import os
import hashlib

def checkFile(fileHashMap, file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

    if fileHash in fileHashMap:
        ### Duplicate file.
        fileHashMap[fileHash].append(file)
        return True
    else:
        fileHashMap[fileHash] = [file]
        return False


def main(argv):
    fileHashMap = {}
    fileCount = 0
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            fileCount += 1
            print("------------: " + str(fileCount))
            print(curDir + file)
            checkFile(fileHashMap, curDir + file)

if __name__ == "__main__":
    main(sys.argv)

The script processes about 10Gb worth of files and then throws MemoryError on the line 'fileData = fReader.read()'. 该脚本处理大约10Gb的文件,然后在'fileData = fReader.read()'行上抛出MemoryError。 I thought that since I am closing the fReader and marking the fileData for deletion after I have calculated the md5 sum I wouldn't run into this. 我认为,因为我在关闭fReader并在计算md5总和后将fileData标记为删除,所以我不会遇到这个问题。 How can I calculate the md5 sums without running into this memory error? 如何在不遇到此内存错误的情况下计算md5总和?

Edit: I was requested to remove the dictionary and look at the memory usage to see if there may be a leak in hashlib. 编辑:我被要求删除字典并查看内存使用情况,以查看hashlib中是否存在泄漏。 Here was the code I ran. 这是我运行的代码。

import sys
import os
import hashlib

def checkFile(file):
    fReader = open(file)
    fileData = fReader.read();
    fReader.close()
    fileHash = hashlib.md5(fileData).hexdigest()
    del fileData

def main(argv):
    for curDir, subDirs, files in os.walk(argv[1]):
        print(curDir)
        for file in files:
            print("------: " + str(curDir + file))
            checkFile(curDir + file)

if __name__ == "__main__":
    main(sys.argv)

and I still get the memory crash. 我仍然得到内存崩溃。

Your problem is in reading the entire files, they're too big and your system can't load it all in memory, so then it throws the error. 你的问题在于读取整个文件,它们太大而你的系统无法将它全部加载到内存中,因此它会抛出错误。

As you can see in the Official Python Documentation, the MemoryError is: 正如您在官方Python文档中看到的, MemoryError是:

Raised when an operation runs out of memory but the situation may still be rescued (by deleting some objects). 当操作耗尽内存时引发但情况仍可能被挽救(通过删除一些对象)。 The associated value is a string indicating what kind of (internal) operation ran out of memory. 关联值是一个字符串,表示内存中耗尽了哪种(内部)操作。 Note that because of the underlying memory management architecture (C's malloc() function), the interpreter may not always be able to completely recover from this situation ; 请注意,由于底层内存管理体系结构(C的malloc()函数), 解释器可能无法始终从这种情况中完全恢复 ; it nevertheless raises an exception so that a stack traceback can be printed, in case a run-away program was the cause. 然而,它会引发异常,以便可以打印堆栈回溯,以防出现失控程序。

For your purpose, you can use hashlib.md5() 为了您的目的,您可以使用hashlib.md5()

In that case, you'll have to read chunks of 4096 bytes sequentially and feed them to the Md5 function: 在这种情况下,您必须按顺序读取4096字节的块并将它们提供给Md5函数:

def md5(fname):
    hash = hashlib.md5()
    with open(fname) as f:
        for chunk in iter(lambda: f.read(4096), ""):
            hash.update(chunk)
    return hash.hexdigest()

Not a solution to your memory problem, but an optimization that might avoid it: 不是您的内存问题的解决方案,而是可以避免它的优化:

  • small files: calculate md5 sum, remove duplicates 小文件:计算md5总和,删除重复项

  • big files: remember size and path 大文件:记住大小和路径

  • at the end, only calculate md5sums of files of same size when there is more than one file 最后,当有多个文件时,只计算相同大小的文件的md5sums

Python's collection.defaultdict might be useful for this. Python的collection.defaultdict可能对此有用。

How about calling openssl command from python In both windows and Linux 如何从python中调用openssl命令在Windows和Linux中

$ openssl md5 "file" $ openssl md5“文件”

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM