Python lxml 无法处理文本超过 1GB 的 XML 元素

Question

The lxml library for Python can handle extremely large XML files, up to (and over?) 100 GB in size. Python 的 lxml 库可以处理非常大的 XML 文件，最大（或超过？）100 GB。 But can it handle extremely large (1 billion characters or more) XML elements?但它能处理非常大（10 亿个字符或更多）的 XML 个元素吗？

I'm tasked with converting binary files to text via base64 encoding, then inserting the encoded text into an element in an XML file.我的任务是通过 base64 编码将二进制文件转换为文本，然后将编码后的文本插入到 XML 文件中的一个元素中。 This produces strings with a lot of characters.这会产生包含很多字符的字符串。

A base64 encoded 252MB makes a string 345 million characters long (I'm rounding off here).一个 base64 编码的 252MB 使字符串长度为 3.45 亿个字符（我在这里四舍五入）。 A 1.5GB file encodes to a string just over 2 billion characters long.一个 1.5GB 的文件编码为一个超过 20亿个字符长的字符串。

The encoding Python script can put any size string into the XML file.编码 Python 脚本可以将任意大小的字符串放入 XML 文件中。 However, the decoding script cannot load the XML file if the element is over about 1.4 billion characters long.但是，如果元素长度超过 14 亿个字符，解码脚本将无法加载 XML 文件。

The code where the script fails is:脚本失败的代码是：

huge_parser = ET.XMLParser(huge_tree=True)
inxml = ET.parse(xmlfile, huge_parser)

If the file element has about 1.3 billion characters, it runs properly.如果文件元素有大约 13 亿个字符，它可以正常运行。 If the file element has about 1.6 billion characters, it fails with this error:如果文件元素有大约 16 亿个字符，它将失败并出现以下错误：

Traceback (most recent call last):
  File "C:\Users\user1\Desktop\xml_parser.py", line 40, in <module>
    inxml = ET.parse(xmlfile, huge_parser)
  File "src\lxml\etree.pyx", line 3536, in lxml.etree.parse
  File "src\lxml\parser.pxi", line 1875, in lxml.etree._parseDocument
  File "src\lxml\parser.pxi", line 1901, in lxml.etree._parseDocumentFromURL
  File "src\lxml\parser.pxi", line 1805, in lxml.etree._parseDocFromFile
  File "src\lxml\parser.pxi", line 1177, in lxml.etree._BaseParser._parseDocFromFile
  File "src\lxml\parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
  File "src\lxml\parser.pxi", line 725, in lxml.etree._handleParseResult
  File "src\lxml\parser.pxi", line 654, in lxml.etree._raiseParseError
  File "bigfile.xml", line 14
lxml.etree.XMLSyntaxError: xmlSAX2Characters overflow prevented, line 14, column 1519383602

I don't think an iterator would help here, as the iterator would still attempt to load the entire massive element.我不认为迭代器在这里会有帮助，因为迭代器仍然会尝试加载整个大型元素。

Should I try Pandas?我应该试试 Pandas 吗？ Can it handle a dataframe of that size?它可以处理那个大小的 dataframe 吗？

Answer 1

According to the libxml2 source, the error occurs while libxml2 is coalescing text chunks into a text node.根据 libxml2 源代码，错误发生在 libxml2 将文本块合并为文本节点时。 The code in question (line 2539, or just search for "overflow prevented") is:有问题的代码（第 2539 行，或者只是搜索“防止溢出”）是：

if ((size_t)ctxt->nodelen > SIZE_T_MAX - (size_t)len || 
        (size_t)ctxt->nodemem + (size_t)len > SIZE_T_MAX / 2) {
            xmlSAX2ErrMemory(ctxt, "xmlSAX2Characters overflow prevented");
            return;
    }

I believe SIZE_T_MAX is equal to 0xffffffff or 4,294,967,295 bytes.我相信 SIZE_T_MAX 等于 0xffffffff 或 4,294,967,295 字节。 Since internally Python uses UCS2 (unless compiled in wide mode), that's just over 2 billion characters ).由于内部 Python 使用 UCS2（除非在宽模式下编译），这刚好超过 20 亿个字符）。

the decoding script cannot load the XML file if the element is over about 1.4 billion characters long如果元素超过 14 亿个字符长，解码脚本无法加载 XML 文件

The code above checks both the string length and total node length, so it's quite possible, with other overhead, that the limit is exceeded.上面的代码同时检查字符串长度和总节点长度，因此很可能会超出限制，加上其他开销。

Python lxml 无法处理文本超过 1GB 的 XML 元素

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-06 21:33:16

Python lxml 无法处理文本超过 1GB 的 XML 元素

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-06 21:33:16

解决方案1
1 已采纳 2022-05-06 21:33:16