[英]Why are non-ASCII characters escaped in attribute-values after writing an XML-file with lxml?
I'm trying to continually build an xml-file with python and with etree.xmlfile
from lxml
.我正在尝试使用
etree.xmlfile
和来自lxml
的 etree.xmlfile 不断构建一个 xml 文件。
My input is an XML-file, where there are umlauts in attribute values.我的输入是一个 XML 文件,其中属性值中有变音符号。 I read this in with
lxml
, make some changes to the names of the attributes, and then write it to a new file.我用
lxml
读了这个,对属性的名称进行了一些更改,然后将其写入一个新文件。
This is my code, broken down:这是我的代码,分解:
with etree.xmlfile(path_to_new_file, encoding="utf8") as xf:
with xf.element("corpus"):
for _, element in etree.iterparse(path_to_original_file, tag="comment"):
new_element = transform_element(element)
xf.write(new_element)
del element
del new_element
In the original file, I might have an element like this:在原始文件中,我可能有这样的元素:
<comment title="Kübel">Some text with umlauts like this üä</comment>
But after processing, the same comment in the new file looks like this:但是经过处理后,新文件中的相同注释看起来像这样:
<comment title="Kübel">Some text with umlauts like this üä</comment>
Do you have any idea what might cause this?你知道是什么原因造成的吗?
ü
does not have to be escaped in an XML attribute value (or in a text node child of an element). ü
不必在 XML 属性值(或元素的文本节点子节点)中转义。
Probably the developer of the library was being overly cautious and called an generic escape string function, possibly to leverage its escaping of <
, which always has to be escaped, and '
or "
which have to be escaped when matching the delimiting quotation mark for the attribute value.可能该库的开发人员过于谨慎,并称其为通用转义字符串 function,可能是利用其 escaping 的
<
,它总是必须被转义,而'
或"
在匹配分隔引号时必须被转义属性值。
For precise escaping requirements concisely presented, see Simplified XML Escaping .有关简明扼要的 escaping 要求,请参阅简化版 XML Escaping 。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.