简体   繁体   English

在 Python 中解析时如何处理内存错误?

[英]How to handle memory error during parsing in Python?

I'm getting a memory error and I'm trying to find the best solution for this problem.我遇到内存错误,我正在尝试为这个问题找到最佳解决方案。 Basically I'm downloading a lot of XML files via multiple threads of the same class.基本上,我通过同一类的多个线程下载了很多 XML 文件。 My class uses the following command to download the files :我的班级使用以下命令下载文件:

urlretrieve(link, filePath)

I'm saving the path of the downloaded files into a Queue that is synced between the threads.我将下载文件的路径保存到线程之间同步的队列中。

downloadedFilesQ.put(filePath)

In another class (also multiple threads) I try to parse those XML files and save them as Python objects that I will save in the db in the future.在另一个类(也是多个线程)中,我尝试解析这些 XML 文件并将它们保存为 Python 对象,我将在将来将其保存在 db 中。 I'm using the following command to parse the file :我正在使用以下命令来解析文件:

    xmldoc = minidom.parse(downloadedFilesQg.get())

The download and parsing flows are running simultaneously.下载和解析流同时运行。 The download flow finishes after about 2 minutes while the parsing flow takes about 15min.下载流程在大约 2 分钟后完成,而解析流程大约需要 15 分钟。 After 15 min I'm getting Memory error on the following line : 15 分钟后,我在以下行收到内存错误:

Exception in thread XMLConverterToObj-21:
Traceback (most recent call last):
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 926, in _bootstrap_inner
    self.run()
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\myuser\PycharmProjects\weat\Parsers\ParseXML.py", line 77, in parseXML
    xmldoc = minidom.parse(xml_file)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\minidom.py", line 1958, in parse
    return expatbuilder.parse(file)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 911, in parse
    result = builder.parseFile(fp)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
  File "c:\_work\16\s\modules\pyexpat.c", line 417, in StartElement
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\expatbuilder.py", line 746, in start_element_handler
    _append_child(self.curNode, node)
  File "C:\Users\myuser\AppData\Local\Programs\Python\Python37-32\lib\xml\dom\minidom.py", line 291, in _append_child
    childNodes.append(node)
MemoryError

The download flow downloads about 1700 files ~ 1.2GB.下载流程下载大约 1700 个文件 ~ 1.2GB。 Each XML file is between 200Bytes to 9MB (max).每个 XML 文件在 200 字节到 9MB(最大)之间。 Until the memory error my code succeeds to create about 500K python objects of the same class :在内存错误之前,我的代码成功创建了大约 500K 个相同类的 python 对象:

from sqlalchemy import Table,Date,TEXT,Column,BIGINT,ForeignKey,Numeric,DateTime,Integer
from base import Base

class Business(Base) :

    __tablename__ = 'business'
    id = Column(BIGINT, primary_key=True)
    BName = Column('business_name',TEXT)
    Owner=Column('owner_id',Integer)
    city=Column('city',TEXT)
    address=Column('address',TEXT)


    def __init__(self,BName,owner,city=None,address=None,workingHours=None):
        self.BName=BName
        self.owner=owner
        self.city=city
        self.address=address

The option I considered about is once I reach 100K python objects , save them to the db and then continue the parsing again.我考虑的选项是一旦我达到 100K python 对象,将它们保存到数据库,然后再次继续解析。 The problem is that multiple business can repeat, therefore I wanted to parse one time all the files and then insert the business into a set (in order to ignore the repeated business).问题是多个业务可以重复,所以我想把所有文件解析一次,然后把业务插入一个集合中(为了忽略重复的业务)。

Are there other things I can try?还有其他我可以尝试的事情吗?

You appear to keep everything in memory at the same time.您似乎同时将所有内容保存在内存中。 RAM, the memory a computer works with, however, is much more limited than storage memory (hard disks). RAM,计算机使用的内存,但是,比存储内存(硬盘)要有限得多。 So you might easily store a lot of XML documents on your storage but cannot hold everything in RAM at the same time.因此,您可以轻松地在存储设备上存储大量 XML 文档,但不能同时将所有内容保存在 RAM 中。

In your case this means that you should change your program fundamentally.在你的情况下,这意味着你应该从根本上改变你的程序。

Your program should work in a streaming fashion, meaning, it should load one XML document, parse it, process it somehow, store its results in a data base, and then forget about this document again .您的程序应该以流式方式工作,这意味着它应该加载一个 XML 文档,解析它,以某种方式处理它,将其结果存储在数据库中,然后再次忘记该文档 The last point is vital to free the RAM the document occupied.最后一点对于释放文档占用的 RAM 至关重要。

Now you write that you need to figure out which documents are repeated.现在您写道,您需要弄清楚哪些文档是重复的。

To achieve this I propose not to store the whole documents in memory but just a hash value for each.为了实现这一点,我建议不要将整个文档存储在内存中,而只是每个文档的哈希值。 For this you need to provide a decent hash function which creates a unique hash value for a given document.为此,您需要提供一个体面的散列函数,它为给定的文档创建一个唯一的散列值。 Then you store just the hash value for each document you processed in a set , and each time you encounter a new document which has the same hash value, you will know that this is a repeated document and can handle it accordingly (eg ignore it).然后你只存储你在一个set处理的每个文档的哈希值,每次你遇到一个具有相同哈希值的新文档时,你就会知道这是一个重复的文档并且可以相应地处理它(例如忽略它) .

While it might be impossible to keep 7000 documents of 9MB size in memory at the same time, it will easily be possible to keep 7000 hash values in memory at the same time.虽然不可能同时在内存中保存 7000 个 9MB 大小的文档,但很容易同时在内存中保存 7000 个哈希值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM