[英]How to make lxml's iterparse ignore invalid XML characters?
I have an XML with invalid characters.我有一个带有无效字符的 XML。 LXML's XMLParser throws an exception on these invalid characters, but when I create XMLParser with recover=True option, it ignores the bad characters and works OK. LXML 的 XMLParser 对这些无效字符抛出异常,但是当我使用 recovery =True选项创建 XMLParser 时,它会忽略坏字符并正常工作。
My question is how can I set similar flag for lxml's iterparse function?我的问题是如何为 lxml 的 iterparse 函数设置类似的标志?
Reproduction:再生产:
The broken XML (/tmp/z.xml):损坏的 XML (/tmp/z.xml):
<?xml version="1.0" encoding="utf-8"?>
<items>
<item>
<B>Bad characters:</B>
</item>
</items>
NOTE: There are two ASCII characters #31 (0x1F) after "Bad characters:" string, which I could not copy-paste here.注意:在“Bad characters:”字符串之后有两个 ASCII 字符 #31 (0x1F),我无法在这里复制粘贴。
The parsing error of XMLParser: XMLParser 的解析错误:
fd = open('/tmp/z.xml')
parser = etree.XMLParser()
tree = etree.parse(fd, parser)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "lxml.etree.pyx", line 2576, in lxml.etree.parse (src/lxml/lxml.etree.c:22796)
File "parser.pxi", line 1488, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:60390)
File "parser.pxi", line 1518, in lxml.etree._parseFilelikeDocument (src/lxml/lxml.etree.c:60687)
File "parser.pxi", line 1401, in lxml.etree._parseDocFromFilelike (src/lxml/lxml.etree.c:59658)
File "parser.pxi", line 991, in lxml.etree._BaseParser._parseDocFromFilelike (src/lxml/lxml.etree.c:57303)
File "parser.pxi", line 538, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:53512)
File "parser.pxi", line 624, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:54372)
File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
To ignore the bad characters I set recover=True and it works OK:为了忽略坏字符,我设置了recovery=True并且它工作正常:
import lxml.etree as etree
fd = open('/tmp/z.xml')
parser = etree.XMLParser(recover=True)
tree = etree.parse(fd, parser)
etree.tostring(tree)
# OUTPUT:
<items>\n\t<item>\n\t\t<B>Bad characters:</B>\n\t</item>\n</items>'
With iterparse I get the same error again, but how can I make it ignore the bad characters?使用 iterparse 我再次得到同样的错误,但我怎样才能让它忽略坏字符?
fd = open('/tmp/z.xml')
it = etree.iterparse(fd, events=("start", "end"))
for e in it: print e
...
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "iterparse.pxi", line 498, in lxml.etree.iterparse.__next__ (src/lxml/lxml.etree.c:73245)
File "parser.pxi", line 564, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:53770)
lxml.etree.XMLSyntaxError: PCDATA invalid Char value 31, line 4, column 21
iterparse
also accepts the recover
argument: iterparse
也接受recover
参数:
it = etree.iterparse(fd, events=("start", "end"), recover=True)
( Documentation: lxml iterparse ) (文档: lxml iterparse )
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.