简体   繁体   中英

LXML kills my CDATA sections

I'm batch-converting a lot of XML files, changing their character encodings to UTF-8:

with open(source_filename, "rb") as source:
    tree = etree.parse(source)

    with open(destination_filename, "wb") as destination:
        tree.write(destination, encoding="UTF-8", xml_declaration=True)

Unfortunately, it is destroying my CDATA sections and just escaping them instead.

Source :

<d><![CDATA[áÌÀøÅàùÑÄéú ëÌÄé áÈàÅùÑ éäå''ä ðÄùÑÀôÌÈè <small><small>(ùí ëå èæ)</small></small>

Destination :

<d>בְּרֵאשִׁית כִּי בָאֵשׁ יהו''ה נִשְׁפָּט &lt;small&gt;&lt;small&gt;(שם כו טז)&lt;/small&gt;&lt;/small&gt;

Is there a setting which I can set which will tell it to leave my CDATA sections alone? I'm mainly using LXML to change the character encoding and to write the XML header properly.

Use the strip_cdata=False option :

import lxml.etree as etree
parser = etree.XMLParser(strip_cdata=False)
with open(source_filename, "rb") as source:
    tree = etree.parse(source, parser=parser)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM