简体   繁体   English

lxml和CDATA和&

[英]lxml and CDATA and &

I have a XML which has CDATA and within that there are tag with URLs that have ampersand in it. 我有一个包含CDATA的XML,并且其中包含带有带有&符号的URL的标记。 I am supposed to use lxml to read across those tags but i am getting an error. 我应该使用lxml来读取这些标签,但出现错误。

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "src\lxml\lxml.etree.pyx", line 3228, in lxml.etree.fromstring (src\lxml\lxml.etree.c:79593)
  File "src\lxml\parser.pxi", line 1848, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:119112)
  File "src\lxml\parser.pxi", line 1729, in lxml.etree._parseDoc (src\lxml\lxml.etree.c:117670)
  File "src\lxml\parser.pxi", line 1063, in lxml.etree._BaseParser._parseUnicodeDoc (src\lxml\lxml.etree.c:111657)
  File "src\lxml\parser.pxi", line 595, in lxml.etree._ParserContext._handleParseResultDoc (src\lxml\lxml.etree.c:105880)
  File "src\lxml\parser.pxi", line 706, in lxml.etree._handleParseResult (src\lxml\lxml.etree.c:107588)
  File "src\lxml\parser.pxi", line 635, in lxml.etree._raiseParseError (src\lxml\lxml.etree.c:106442)
  File "<string>", line 9
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 9, column 98

How could i get past this error? 我如何克服这个错误? Am i doing i right? 我说的对吗? Do we need to replace & with something? 我们需要用什么代替&吗?

The code is as below 代码如下

from lxml import etree
ns0_NAMESPACE = "http://webservices.online.webapp.paperless.cl"
ns0 = "{%s}" % ns0_NAMESPACE
NSMAP = {'ns0':ns0_NAMESPACE}

response="""
<soapenv:Envelope xmlns:soapenv="http://schemas.xmlsoap.org/soap/envelope/">
    <soapenv:Body>
    <ns:OnlineGeneration2Response xmlns:ns="http://webservices.online.webapp.cl">
        <ns:return>
            <![CDATA[<EstadoDoc>
            <Estado>Ok<Estado>
            <RutEmisor>81201000-K</RutEmisor>
            <TipoDte>52</TipoDte>
            <FolioM>117620901</FolioM>
            <Folio>25022</Folio>
            <Glosa>NO INFORMADO</Glosa>
            <UrlDte>http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvkPrUZDtY6hMg==</UrlDte>
            </EstadoDoc>
            <EstadoLote>
                <UrlPdf>http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47</UrlPdf>
                <UrlCaratula>http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47</UrlCaratula>
            </EstadoLote>]]>
        </ns:return>
    </ns:OnlineGeneration2Response>
    </soapenv:Body>
</soapenv:Envelope>"""
root=etree.fromstring(response)
sub_element=root.xpath('//ns0:return',namespaces=NSMAP)
print sub_element.text
if sub_element:
    sub_element=sub_element[0]
EstadoDoc_root=etree.fromstring(sub_element.text)

The problem is that the contents of the <ns:return> element's text (the CDATA section) are not legal XML. 问题在于<ns:return>元素的文本(CDATA部分)的内容不是合法的XML。 If you replace & with &amp; 如果您将&替换为&amp; in that text before passing it to etree.fromstring , the parse should succeed. 在将该文本传递给etree.fromstring之前,解析应该成功。
In general, hiding XML in a CDATA section is not a good idea; 通常,将XML隐藏在CDATA节中不是一个好主意。 this is only one examole of the problems it can cause. 这只是它可能引起的问题的一个例子。 If you have any influence over the party generating this XML, I'd recommend trying to get them to change it. 如果您对生成此XML的一方有任何影响,建议您尝试让他们进行更改。

use XML parser's recover option: 使用XML解析器的恢复选项:

parser = etree.XMLParser(recover=True)

EstadoDoc_root = etree.fromstring(sub_element.text, parser=parser)

Then to grab the URLs (or change this to whatever you need): 然后获取URL(或将其更改为所需的URL):

print [x.text for x in EstadoDoc_root.xpath('//UrlCaratula|//UrlPdf')]

['http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47',
 'http://G500603svGLH:8080/Facturacion/XMLServlet?docId=']

The second URL is missing the portion of the URL that comes after & ... Is there a way to avoid this? 第二个URL缺少&后面的URL部分。有没有办法避免这种情况?

Use the html parser to normalize and handle the violating characters (note the lowercase tags) 使用html解析器来规范化和处理违规字符(请注意小写标记)

from lxml import html
EstadoDoc_root = html.fromstring(sub_element)

print [x.text for x in EstadoDoc_root.xpath('//urlcaratula|//urlpdf')]

['http://G500603svGLH:8080/Facturacion/PDFServlet?docId=uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47',
 'http://G500603svGLH:8080/Facturacion/XMLServlet?docId=&uR1v4VhQHvmQJLl22c1DFOLW3c4qbQ47']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM