简体   繁体   中英

Remove CDATA from XML

I am working on a SOAP api with python-suds.

Api returns result and suds parse it according to WSDL. result data have an XML data field

(MyServiceResult){
    errorMsg = "Error Message here..."
    sessionId = "..."
    outputDataXML = "<![CDATA[<Results>.....<Details>....</Details></Results>]]>"
    errorCode = "00"
 }

So I planned to use xml.etree.ElementTree to parse the xml data part outputDataXML . But since returning data starts with <![CDATA[ , xml parser fails with

ParseError: syntax error: line 1, column 0

What is the best approach for a such situation except usge of regex?

Call ET.fromstring once to extract the text from the CDATA. Call ET.fromstring a second time to parse the string as XML:

import xml.etree.ElementTree as ET

d = '<![CDATA[<Results>.....<Details>....</Details></Results>]]>'
fix = '<root>{}</root>'.format(d)

content = ET.fromstring(fix).text
print(repr(content))
# '<Results>.....<Details>....</Details></Results>'

results = ET.fromstring(content)
print(ET.tostring(results))
# <Results>.....<Details>....</Details></Results>

When reading all kind of weird formatted XML-like data, you can always use BeautifulSoup :

>>> from bs4 import BeautifulSoup
>>> d="<![CDATA[<Results>.....<Details>....</Details></Results>]]>"
>>> soup=BeautifulSoup(d)
>>> from xml.etree import ElementTree
>>> tree=ElementTree.fromstring(str(soup))

Otherwise, you can make a quick hack like this:

tree = ElementTree.fromstring(outputDataXML.replace("<![CDATA[", "").replace("]]>", ""))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM