简体   繁体   中英

lxml and CDATA and &

I have a XML which has CDATA and within that there are tag with URLs that have ampersand in it. I am supposed to use lxml to read across those tags but i am getting an error.

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593)
  File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119112)
  File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:117670)
  File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:111657)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105880)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:107588)
  File "src\lxml\parser.pxi", line 635, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:106442)
  File "<string>", line 9
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 9, column 98

How could i get past this error? Am i doing i right? Do we need to replace & with something?

The code is as below

from lxml import etree
ns0_NAMESPACE = "http://webservices.online.webapp.paperless.cl"
ns0 = "{%s}" % ns0_NAMESPACE
NSMAP = {'ns0':ns0_NAMESPACE}

response="""
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
    <soapenv:Body>
    <ns:OnlineGeneration2Response xmlns:ns="http://webservices.online.webapp.cl">
        <ns:return>
            <![CDATA[<EstadoDoc>
            <Estado>Ok<Estado>
            <RutEmisor>81201000-K</RutEmisor>
            <TipoDte>52</TipoDte>
            <FolioM>117620901</FolioM>
            <Folio>25022</Folio>
            <Glosa>NO INFORMADO</Glosa>
            <UrlDte>http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvkPrUZDtY6hMg==</UrlDte>
            </EstadoDoc>
            <EstadoLote>
                <UrlPdf>http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47</UrlPdf>
                <UrlCaratula>http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47</UrlCaratula>
            </EstadoLote>]]>
        </ns:return>
    </ns:OnlineGeneration2Response>
    </soapenv:Body>
</soapenv:Envelope>"""
root=etree.fromstring(response)
sub_element=root.xpath('//ns0:return',namespaces=NSMAP)
print sub_element.text
if sub_element:
    sub_element=sub_element[0]
EstadoDoc_root=etree.fromstring(sub_element.text)

The problem is that the contents of the <ns:return> element's text (the CDATA section) are not legal XML. If you replace & with &amp; in that text before passing it to etree.fromstring , the parse should succeed.
In general, hiding XML in a CDATA section is not a good idea; this is only one examole of the problems it can cause. If you have any influence over the party generating this XML, I'd recommend trying to get them to change it.

use XML parser's recover option:

parser = etree.XMLParser(recover=True)

EstadoDoc_root = etree.fromstring(sub_element.text, parser=parser)

Then to grab the URLs (or change this to whatever you need):

print [x.text for x in EstadoDoc_root.xpath('//UrlCaratula|//UrlPdf')]

['http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47',
 'http://G500603svGLH:8080/Facturacion/XMLServlet?docId=']

The second URL is missing the portion of the URL that comes after & ... Is there a way to avoid this?

Use the html parser to normalize and handle the violating characters (note the lowercase tags)

from lxml import html
EstadoDoc_root = html.fromstring(sub_element)

print [x.text for x in EstadoDoc_root.xpath('//urlcaratula|//urlpdf')]

['http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47',
 'http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47']

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM