使用“ xmltodict”模块解析大型XML文件会导致OverflowError

Question

I have a fairly large XML File of about 3GB size that I am wanting to parse in streaming mode using 'xmltodict' utility. 我有一个相当大的XML文件，大约3GB，我想使用'xmltodict'实用程序以流模式进行解析。 The code I have iterates through each item and forms a dictionary item and appends to the dictionary in memory, eventually to be dumped as json in a file. 我所遍历的代码遍历每个项目，形成一个字典项目，并追加到内存中的字典，最终以json的形式转储到文件中。

I have the following working perfectly on a small xml data set: 我对小型xml数据集具有以下完美的工作方式：

    import xmltodict, json
    import io

    output = []

    def handle(path, item):
       #do stuff
       return

    doc_file = open("affiliate_partner_feeds.xml","r")
    doc = doc_file.read()        
    xmltodict.parse(doc, item_depth=2, item_callback=handle)

    f = open('jbtest.json', 'w')
    json.dump(output,f)

On a large file, I get the following: 在大文件上，我得到以下信息：

Traceback (most recent call last):
  File "jbparser.py", line 125, in <module>
    **xmltodict.parse(doc, item_depth=2, item_callback=handle)**
  File "/usr/lib/python2.7/site-packages/xmltodict.py", line 248, in parse
    parser.Parse(xml_input, True)
  OverflowError: size does not fit in an int

The exact location of exception inside xmltodict.py is: xmltodict.py中异常的确切位置是：

def parse(xml_input, encoding=None, expat=expat, process_namespaces=False,
          namespace_separator=':', **kwargs):

        handler = _DictSAXHandler(namespace_separator=namespace_separator,
                                  **kwargs)
        if isinstance(xml_input, _unicode):
            if not encoding:
                encoding = 'utf-8'
            xml_input = xml_input.encode(encoding)
        if not process_namespaces:
            namespace_separator = None
        parser = expat.ParserCreate(
            encoding,
            namespace_separator
        )
        try:
            parser.ordered_attributes = True
        except AttributeError:
            # Jython's expat does not support ordered_attributes
            pass
        parser.StartElementHandler = handler.startElement
        parser.EndElementHandler = handler.endElement
        parser.CharacterDataHandler = handler.characters
        parser.buffer_text = True
        try:
            parser.ParseFile(xml_input)
        except (TypeError, AttributeError):
            **parser.Parse(xml_input, True)**
        return handler.item

Any way to get around this? 有什么办法解决这个问题？ AFAIK, the xmlparser object is not exposed for me to play around and change 'int' to 'long'. AFAIK，xmlparser对象不会暴露给我玩，并将'int'更改为'long'。 More importantly, what is really going on here? 更重要的是，这里到底发生了什么？ Would really appreciate any leads on this. 真的很感谢任何潜在客户。 Thanks! 谢谢！

Answer 1

Try to use marshal.load(file) or marshal.load(sys.stdin) in order to unserialize the file (or to use it as a stream) instead of reading the whole file into memory and then parse it as a whole. 尝试使用marshal.load（file）或marshal.load（sys.stdin）来反序列化文件（或将其用作流），而不是将整个文件读入内存，然后将其整体解析。

Here is an example : 这是一个例子：

>>> def handle_artist(_, artist):
...     print artist['name']
...     return True
>>> 
>>> xmltodict.parse(GzipFile('discogs_artists.xml.gz'),
...     item_depth=2, item_callback=handle_artist)
A Perfect Circle
Fantômas
King Crimson
Chris Potter
...

STDIN: STDIN：

import sys, marshal
while True:
    _, article = marshal.load(sys.stdin)
    print article['title']

使用“ xmltodict”模块解析大型XML文件会导致OverflowError

问题描述

1 个解决方案

解决方案1
0 2016-02-13 10:39:34

使用“ xmltodict”模块解析大型XML文件会导致OverflowError

问题描述

1 个解决方案

解决方案1 0 2016-02-13 10:39:34

解决方案1
0 2016-02-13 10:39:34