[英]Python lxml: Ignore XML declaration (errors)
I am trying to parse the file browser Thunar's custom actions files ( ~/.config/Thunar/uca.xml
) with the lxml
Python module. 我试图用lxml
Python模块解析文件浏览器Thunar的自定义操作文件( ~/.config/Thunar/uca.xml
)。
For some reason, Thunar obviously writes a malformed declaration
into these files: 出于某种原因,Thunar显然在这些文件中写了一个malformed declaration
:
<?xml encoding="UTF-8" version="1.0"?>
Obviously, the version
is expected to appear as the first "attribute" in the declaration. 显然,该version
预计将作为声明中的第一个“属性”出现。 lxml
raises an XMLSyntaxError
if I try to parse the file. 如果我尝试解析文件, lxml
会引发XMLSyntaxError
。
And no, I cannot simply correct the declaration, becaue Thunar keeps overwriting it with the bogus one. 不,我不能简单地纠正声明,因为Thunar一直用伪造的东西覆盖它。
This might very likely be a bug in Thunar. 这很可能是Thunar的一个错误。
Nevertheless, I would like to know how to ignore the XML declaration with lxml
. 不过, 我想知道如何忽略lxml
的XML声明。
I know that I could pre-process the XML document to filter out the XML declaration. 我知道我可以预处理XML文档来过滤掉XML声明。 But this doesn't seem very elegant. 但这似乎并不优雅。 Since XML seems to default to version 1.0 and UTF-8 encoding, there surely is a possibility to just ignore the declaration and assume that in lxml
. 由于XML似乎默认为1.0版和UTF-8编码,因此肯定有可能忽略声明并假设在lxml
。 I didn't find anything in the documentation or on google, I might have overlooked something. 我没有在文档中或谷歌上找到任何内容,我可能忽略了一些东西。
I know very little about Thunar, but if it produces the XML declaration in the question, then that is a bug. 我对Thunar知之甚少,但如果它在问题中产生XML声明,那么这就是一个bug。 Having an incorrect XML declaration makes the document ill-formed. 具有不正确的XML声明会使文档格式错误。
The XML grammar specifies one correct order for the items in the XML declaration. XML语法为XML声明中的项指定了一个正确的顺序。 version
must come first and encoding
second. version
必须先到第二个encoding
。 See http://w3.org/TR/xml/#NT-XMLDecl . 见http://w3.org/TR/xml/#NT-XMLDecl 。
However, with lxml you can parse using a parser instance that has the recover
option set to True
. 但是,使用lxml,您可以使用将recover
选项设置为True
的解析器实例进行解析。 It works in this case. 它适用于这种情况。 The bad XML declaration is ignored. 错误的XML声明被忽略。
from lxml import etree
parser = etree.XMLParser(recover=True)
tree = etree.parse('uca.xml', parser)
See http://lxml.de/api/lxml.etree.XMLParser-class.html 请参阅http://lxml.de/api/lxml.etree.XMLParser-class.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.