简体   繁体   中英

CDATA getting stripped in lxml even after using strip_cdata=False

I have a requirement in which I need to read a XML file and replace a string with a certain value. The XML contains CDATA element and I need to preserve it. I have tried using parser and setting strip_data to false. This is not working and need help to figure out a way to achieve it.

import lxml.etree as ET

parser1 = ET.XMLParser(strip_cdata=False)

with open('testxml.xml', encoding="utf8") as f:
tree = ET.parse(f, parser=parser1)

root = tree.getroot()
for elem in root.getiterator():
    try:
        elem.text = elem.text.replace('Bundled Manager 2.2(8b)', '123456')
    except AttributeError:
        pass

tree.write('output_new8.xml', xml_declaration=True, method='xml',  encoding="utf8")

Below is the Sample xml:


     <?xml version="1.0" encoding="UTF-8" standalone="no"?><!-- Copyright   (c) 2015 Moto Company, LLC. All rights reserved. Moto Confidential/Proprietary Information -->
<Benchmark>
       <status date="2013-03-11">draft</status>
    <title>Logitech TMM block(TM) System 300 Release Certification Matrix</title>
    <description>Random discription</description>
    <version time="2013-03-05T15:20:20.995-04:00" update="">3.0.0-2017.03.00</version>
        <model system="urn:xccdf:scoring:default"/>
    <Profile id="xccdf_com.Moto_profile_release_4.0.21">
        <status date="2016-03-30">draft</status>
        <title>RCM 4.0.21</title>
        <description><![CDATA[<p>Moto Vblock System 300 Release 4.0.21</p>
<ul><li> TMM VNX OE for File was updated to 7.1.79.8.</li>
</ul>]]>
</description>
        <set-value idref="xccdf_com.Moto_value_vision_content_version">3.0.0-2015.07.00</set-value>
        <set-value idref="xccdf_com.Moto_value_vision_version">3.0.0</set-value>
        <set-value idref="xccdf_com.Moto_value_vplex_version">5.3.0.03.00.04</set-value>
        <set-value idref="xccdf_com.Moto_value_powerpath_version">Bundled Manager 2.2(8b)</set-value>       
        <select idref="xccdf_com.Moto_rule_vnx_version" selected="true"/>
        <select idref="xccdf_com.Moto_rule_vplex_version" selected="true"/>
    </Profile>
</Benchmark>

The output of the code is shown below:

<?xml version='1.0' encoding='UTF8'?>
<!-- Copyright (c) 2015 Moto Company, LLC. All rights reserved. Moto Confidential/Proprietary Information --><Benchmark>
    <status date="2013-03-11">draft</status>
    <title>Logitech TMM block(TM) System 300 Release Certification Matrix</title>
    <description>Random discription</description>
    <version time="2013-03-05T15:20:20.995-04:00" update="">3.0.0-2017.03.00</version>
        <model system="urn:xccdf:scoring:default"/>
    <Profile id="xccdf_com.Moto_profile_release_4.0.21">
        <status date="2016-03-30">draft</status>
        <title>RCM 4.0.21</title>
        <description>&lt;p&gt;Moto Vblock System 300 Release 4.0.21&lt;/p&gt;
&lt;ul&gt;&lt;li&gt; TMM VNX OE for File was updated to 7.1.79.8.&lt;/li&gt;
&lt;/ul&gt;
</description>
        <set-value idref="xccdf_com.Moto_value_vision_content_version">3.0.0-2015.07.00</set-value>
        <set-value idref="xccdf_com.Moto_value_vision_version">3.0.0</set-value>
        <set-value idref="xccdf_com.Moto_value_vplex_version">5.3.0.03.00.04</set-value>
        <set-value idref="xccdf_com.Moto_value_powerpath_version">123456</set-value>        
        <select idref="xccdf_com.Moto_rule_vnx_version" selected="true"/>
        <select idref="xccdf_com.Moto_rule_vplex_version" selected="true"/>
    </Profile>
</Benchmark

>

As you can see , CDATA section is stripped. It will be great if someone can help me here.

This is because you are doing

elem.text = elem.text.replace('Bundled Manager 2.2(8b)', '123456')

which replaces the CDATA with a normal text node.

The documentation states

Note how the .text property does not give any indication that the text content is wrapped by a CDATA section. If you want to make sure your data is wrapped by a CDATA block, you can use the CDATA() text wrapper.

Therefore, if you want to keep the CDATA section, you should only assign to elem.text if you are modifying it, and instruct lxml to use a CDATA section:

if 'Bundled Manager 2.2(8b)' in elem.text:
    elem.text = ET.CDATA(elem.text.replace('Bundled Manager 2.2(8b)', '123456'))

Due to how the ElementTree library works (the entire text and cdata content is concatenated and exposed as a str in the .text property), it's not really possible to know whether CDATA was originally used or not. (see Figuring out where CDATA is in lxml element? and the source code )

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM