Remove CDATA from XML

Question

I am working on a SOAP api with python-suds.

Api returns result and suds parse it according to WSDL. result data have an XML data field

(MyServiceResult){
    errorMsg = "Error Message here..."
    sessionId = "..."
    outputDataXML = "<![CDATA[<Results>.....<Details>....</Details></Results>]]>"
    errorCode = "00"
 }

So I planned to use xml.etree.ElementTree to parse the xml data part outputDataXML . But since returning data starts with <![CDATA[ , xml parser fails with

ParseError: syntax error: line 1, column 0

What is the best approach for a such situation except usge of regex?

Answer 1

Call ET.fromstring once to extract the text from the CDATA. Call ET.fromstring a second time to parse the string as XML:

import xml.etree.ElementTree as ET

d = '<![CDATA[<Results>.....<Details>....</Details></Results>]]>'
fix = '<root>{}</root>'.format(d)

content = ET.fromstring(fix).text
print(repr(content))
# '<Results>.....<Details>....</Details></Results>'

results = ET.fromstring(content)
print(ET.tostring(results))
# <Results>.....<Details>....</Details></Results>

Answer 2

When reading all kind of weird formatted XML-like data, you can always use BeautifulSoup :

>>> from bs4 import BeautifulSoup
>>> d="<![CDATA[<Results>.....<Details>....</Details></Results>]]>"
>>> soup=BeautifulSoup(d)
>>> from xml.etree import ElementTree
>>> tree=ElementTree.fromstring(str(soup))

Otherwise, you can make a quick hack like this:

tree = ElementTree.fromstring(outputDataXML.replace("<![CDATA[", "").replace("]]>", ""))

Remove CDATA from XML

Question

2 answers

solution1
3 ACCPTED 2014-08-12 12:23:06

solution2
1 2014-08-12 11:46:20

Remove CDATA from XML

Question

2 answers

solution1 3 ACCPTED 2014-08-12 12:23:06

solution2 1 2014-08-12 11:46:20

solution1
3 ACCPTED 2014-08-12 12:23:06

solution2
1 2014-08-12 11:46:20