简体   繁体   中英

Parsing large XML file with lxml

I am trying to parse the dblp.xml file(3.2gb) using lxml. The following below is my code.

from lxml import etree
from io import StringIO, BytesIO
tree = etree.parse("dblp.xml")

However I get an error stating:

OSError                                   Traceback (most recent call last)
<ipython-input-5-6a342013a160> in <module>
      1 from lxml import etree
      2 from io import StringIO, BytesIO
----> 3 tree = etree.parse("dblp.xml")

src/lxml/etree.pyx in lxml.etree.parse()

src/lxml/parser.pxi in lxml.etree._parseDocument()

src/lxml/parser.pxi in lxml.etree._parseDocumentFromURL()

src/lxml/parser.pxi in lxml.etree._parseDocFromFile()

src/lxml/parser.pxi in lxml.etree._BaseParser._parseDocFromFile()

src/lxml/parser.pxi in lxml.etree._ParserContext._handleParseResultDoc()

src/lxml/parser.pxi in lxml.etree._handleParseResult()

src/lxml/parser.pxi in lxml.etree._raiseParseError()

OSError: Error reading file 'dblp.xml': failed to load external entity "dblp.xml"

Both dblp.xml and dblp.dtd is in the root folder already.

Please help!

You can use etree.iterparse to avoid loading the whole file in memory:

events = ("start", "end")
with open("dblp.xml", "r") as fo:
    context = etree.iterparse(fo, events=events)
    for action, elem in context:
        # Do something

This will allow you to only extract entities you need while ignoring others.

As Jan Jaap Meijerink stated, you may try to use iterparse. Possibly you could also disable lxml security features preventing parsing huge files (see documentation at https://lxml.de/api/lxml.etree.XMLParser-class.html ):

with open('', 'r') as fobj:
    for event, elem in  etree.iterparse(
                    fobj,
                    huge_tree=True,
                ):
            #do something with element or event

Eventually, if you prefer trying use of parse, you may define xml parser with huge_tree enabled and set it as default for further usages of etree.parse:

xml_parser_settings = dict(
    huge_tree=True, # resolve_entities=False, remove_pis=True, no_network=True
)

XMLPARSER = etree.XMLParser(xml_parser_settings)
etree.set_default_parser(XMLPARSER)

After those statements you may use etree.parser with configured XMLPARSER. Beware of multithreading, though ( https://lxml.de/1.3/api/lxml.etree-module.html#set_default_parser ).

Adding resolve_entities, remove_pis and no_network keyword may (at least a bit) reduce your risk of parsing huge extarnal files, if they come from untrusted source.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM