简体   繁体   English

xmltodict.unparse 未正确处理 CDATA

[英]xmltodict.unparse is not handling CDATA properly

I am trying to use xmltodict to manipulate an XML content as python object, but I am facing an issue to handle properly CDATA.我正在尝试使用 xmltodict 将 XML 内容作为 python 对象进行操作,但我面临着正确处理 CDATA 的问题。 I think I am missing something somewhere, this is my code:我想我在某处遗漏了一些东西,这是我的代码:

import xmltodict

data = """<node1>
    <node2 id='test'><![CDATA[test]]></node2>
    <node3 id='test'>test</node3>
</node1>"""

data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data

print xmltodict.unparse(data, pretty=True)  

And this is the output:这是输出:

OrderedDict([(u'node1', OrderedDict([(u'node2', OrderedDict([(u'@id', u'test'), ('#text', u'test')])), (u'node3', OrderedDict([(u'@id', u'test'), ('#text', u'test')]))]))])
<?xml version="1.0" encoding="utf-8"?>
<node1>
        <node2 id="test">test</node2>
        <node3 id="test">test</node3>
</node1>

We can see here that the CDATA is missing in the generated node2, and also node2 is the same as node3.我们可以看到这里生成的node2中缺少CDATA,并且node2与node3相同。 However, in the input the nodes are different.但是,在输入中,节点是不同的。

Regards问候

I finally managed to get it working by performing this monkey-patch. 我最终通过执行此猴子补丁使它工作。 I am still not very happy with it, It's really a 'hack' this feature should be included somewhere properly: 我对此仍然不太满意,这确实是一个“ hack”,应该在适当的位置包含此功能:

import xmltodict
def escape_hacked(data, entities={}):
    if data[0] == '<' and  data.strip()[-1] == '>':
        return '<![CDATA[%s]]>' % data

    return escape_orig(data, entities)


xml.sax.saxutils.escape = escape_hacked

and then run your python code normally: 然后正常运行您的python代码:

data = """<node1>
    <node2 id='test'><![CDATA[test]]></node2>
    <node3 id='test'>test</node3>
</node1>"""

data = xmltodict.parse(data,force_cdata=True, encoding='utf-8')
print data

print xmltodict.unparse(data, pretty=True) 

To explain, the following line detect if the data is a valid XML, then it add the CDATA tag arround it: 为了说明,下面的代码行检测数据是否为有效的XML,然后在其周围添加CDATA标记:

    if data[0] == '<' and  data.strip()[-1] == '>':
        return '<![CDATA[%s]]>' % data

Regards 问候

I want to clarify that there is no officially supported way to keep the CDATA section.我想澄清一下,没有官方支持的方式来保留CDATA部分。

You could check the issue here.你可以这里检查问题

Based on the above facts, you need DIY.基于以上事实,你需要DIY。 There are two approaches:有两种做法:

  1. Subclassing XMLGenerator子类化XMLGenerator
import xmltodict

class XMLGenerator(xmltodict.XMLGenerator):
    def characters(self, content):
        if content:
            self._finish_pending_start_element()
            if not isinstance(content, str):
                content = str(content, self._encoding)
            self._write('<![CDATA[' + content + ']]>')

xmltodict.XMLGenerator = XMLGenerator

It is not a hack, so it won't change the behaviour of xmltodict other than unparse() .这是不是黑客攻击,所以它不会改变的行为xmltodict比其他unparse() More importantly, it won't pollute the built-in library xml .更重要的是,它不会污染内置库xml

out_xml = xmltodict.unparse(data)

From now on, every nodes having character data will keep the CDATA section.从现在开始,每个拥有字符数据的节点都会保留CDATA部分。

  1. Unescaping the escaped XML对转义的 XML 进行转义
from xml.sax.saxutils import unescape

START_CDATA = '<![CDATA['
END_CDATA = ']]>'

def preprocessor(key, value):
    if key in KEEP_CDATA_SECTION:
        if isinstance(value, dict) and '#text' in value:
            value['#text'] = START_CDATA + value['#text'] + END_CDATA
        else:
            value = START_CDATA + value + END_CDATA
    return key, value

Now, you can specify which nodes should be encapsulated into the CDATA section.现在,您可以指定应将哪些节点封装到CDATA部分中。

KEEP_CDATA_SECTION = ['node2']

out_xml = xmltodict.unparse(data, preprocessor=preprocessor)
out_xml = unescape(out_xml)

However, you shall not try it on the untrusted data, because this approach not only unescapes the character data but also unescapes the nodes' attributes.但是,您不应在不受信任的数据上尝试它,因为这种方法不仅对字符数据进行了转义,而且还对节点的属性进行了转义。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM