Python lxml cannot handle XML element with over 1GB of text

Question

The lxml library for Python can handle extremely large XML files, up to (and over?) 100 GB in size. But can it handle extremely large (1 billion characters or more) XML elements?

I'm tasked with converting binary files to text via base64 encoding, then inserting the encoded text into an element in an XML file. This produces strings with a lot of characters.

A base64 encoded 252MB makes a string 345 million characters long (I'm rounding off here). A 1.5GB file encodes to a string just over 2 billion characters long.

The encoding Python script can put any size string into the XML file. However, the decoding script cannot load the XML file if the element is over about 1.4 billion characters long.

The code where the script fails is:

huge_parser = ET.XMLParser(huge_tree=True)
inxml = ET.parse(xmlfile, huge_parser)

If the file element has about 1.3 billion characters, it runs properly. If the file element has about 1.6 billion characters, it fails with this error:

Traceback (most recent call last):
  File "C:\Users\user1\Desktop\xml_parser.py", line 40, in <module>
    inxml = ET.parse(xmlfile, huge_parser)
  File "src\lxml\etree.pyx", line 3536, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1901, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1805, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile
  File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
  File "bigfile.xml", line 14
lxml.etree.XMLSyntaxError: xmlSAX2Characters overflow prevented, line 14, column 1519383602

I don't think an iterator would help here, as the iterator would still attempt to load the entire massive element.

Should I try Pandas? Can it handle a dataframe of that size?

Answer 1

According to the libxml2 source, the error occurs while libxml2 is coalescing text chunks into a text node. The code in question (line 2539, or just search for "overflow prevented") is:

if ((size_t)ctxt->nodelen > SIZE_T_MAX - (size_t)len || 
        (size_t)ctxt->nodemem + (size_t)len > SIZE_T_MAX / 2) {
            xmlSAX2ErrMemory(ctxt, "xmlSAX2Characters overflow prevented");
            return;
    }

I believe SIZE_T_MAX is equal to 0xffffffff or 4,294,967,295 bytes. Since internally Python uses UCS2 (unless compiled in wide mode), that's just over 2 billion characters ).

the decoding script cannot load the XML file if the element is over about 1.4 billion characters long

The code above checks both the string length and total node length, so it's quite possible, with other overhead, that the limit is exceeded.

Python lxml cannot handle XML element with over 1GB of text

Question

1 answers

solution1
1 ACCPTED 2022-05-06 21:33:16

Python lxml cannot handle XML element with over 1GB of text

Question

1 answers

solution1 1 ACCPTED 2022-05-06 21:33:16

solution1
1 ACCPTED 2022-05-06 21:33:16