繁体   English   中英

使用 Python 将 xml 转换为 html

[英]Converting xml to html using Python

我有这样的页面:

<?xml version="1.0" encoding="utf-8"?>\r\n<HTMLReturn xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://gccwebapps/PROWWS/">\r\n  <Result>OK</Result>\r\n  <ErrorMessageNewLine>\n</ErrorMessageNewLine>\r\n  <ErrorMessage />\r\n  <ID />\r\n  <HTML>&lt;div id=\'DivPROWContainer\' class=\'PROWContainer\'&gt;\n&lt;div id=\'DivTableGCCDocsHolder\' class=\'TableGCCDocsHolder\'&gt;\n&lt;table id=\'TableDisplayTable\' class=\'DisplayTable DisplayGCCDocsTable HtmlDataTable\'&gt;\n&lt;tbody&gt;\n&lt;tr class=\'DisplayTableHeaderRow HtmlDataTableHeaderRow DisplayTableTopRow\'&gt;\n&lt;th colspan=\'5\'&gt;Documents available for the planning Application&lt;/th&gt;\n&lt;/tr&gt;\n&lt;tr class=\'DisplayTableHeaderRow HtmlDataTableHeaderRow\'&gt;\n&lt;th&gt;Application Number&lt;/th&gt;\n&lt;th&gt;Plan number&lt;/th&gt;\n&lt;th&gt;Document type&lt;/th&gt;\n&lt;th&gt;Description&lt;/th&gt;\n&lt;th&gt;Date Entered&lt;/th&gt;\n&lt;/tr&gt;\n&lt;tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_DEC_LET.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;22/0001/NONMAT\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;&lt;/td&gt;\n&lt;td&gt;Text&lt;/td&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_DEC_LET.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;Decision Letter\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;26/01/2022&lt;/td&gt;\n&lt;/tr&gt;\n&lt;tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_APP_FORM_RED.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;22/0001/NONMAT\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;&lt;/td&gt;\n&lt;td&gt;Plan&lt;/td&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_APP_FORM_RED.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;Application Form 9Redacted)\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;10/01/2022&lt;/td&gt;\n&lt;/tr&gt;\n&lt;tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_LAND_PLAN_P20_2956_05D.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;22/0001/NONMAT\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;P20_2956_05D&lt;/td&gt;\n&lt;td&gt;Text&lt;/td&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_LAND_PLAN_P20_2956_05D.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;Landscape MasterPlan 04.01.22\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;10/01/2022&lt;/td&gt;\n&lt;/tr&gt;\n&lt;tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_ELEC_SERV_190123_SC_XX_XX_DR_E_600.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;22/0001/NONMAT\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;190123_SC_XX_XX_DR_E_600&lt;/td&gt;\n&lt;td&gt;Plan&lt;/td&gt;\n&lt;td&gt;&lt;a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_ELEC_SERV_190123_SC_XX_XX_DR_E_600.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'&gt;Electrical Services Site Wide\n&lt;/a&gt;&lt;/td&gt;\n&lt;td&gt;10/01/2022&lt;/td&gt;\n&lt;/tr&gt;\n&lt;/tbody&gt;\n\n&lt;/table&gt;\n&lt;/div&gt;\n&lt;div class=\'PROWDefaultFooter\'&gt;\n&lt;div class=\'PROWFooter1\'&gt;© 2014-21 Gloucestershire County Council, Shire Hall, Westgate Street, Gloucester GL1 2TG.\n&lt;/div&gt;\n&lt;div class=\'PROWFooter2\'&gt;&lt;STRONG&gt;Telephone:&lt;/STRONG&gt;+44(0)1452 425000 - &lt;STRONG&gt; Out of hours:&lt;/STRONG&gt; +44(0)845 6677788\n&lt;/div&gt;\n&lt;div class=\'PROWFooter2\'&gt;\n&lt;a id=\'AGCCLink\' class=\'GCCFooterLink\' href=\'http://www.gloucestershire.gov.uk\' data-DisableMeWhenSomethingChanged=\'1\'&gt;www.gloucestershire.gov.uk\n&lt;/a&gt;\n&lt;/div&gt;\n&lt;/div&gt;\n&lt;/div&gt;\n</HTML>\r\n  <Script>gcc_docs_startScreenSetup();</Script>\r\n</HTMLReturn>

我需要使用 xpath(没有命名空间)在其中查找元素。 我尝试了不同的变体,但我收到了一些非常短且空的 output(5-6 字节):

那是我尝试过的变种。 如您所见 - 它们都不起作用。

import lxml.html as html
res = html.fromstring(sec_response.body)
len(res)
5
res.xpath('//div')
[]

import xml.etree.ElementTree as ET
xhtml = ET.fromstring(sec_response.text)
len(xhtml)
6
xhtml.xpath('//div')
*** AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'

from lxml import etree
xslt_root = etree.XML(sec_response.body)
len(xslt_root)
6
xslt_root.xpath('//div')
[]

sec_response.selector.remove_namespaces()
sec_response.xpath('//td')
[]
sec_response.xpath('//tr')
[]

请展示转换它的方法,以便 xpath 可以使用它(我需要查找 //tr 或 //td 或 //a 元素并找到它)。

scrapy shell file:///....../temp.xml # your page's code

In [1]: response.xpath('//div')
Out[1]: []

In [2]: import html

In [3]: from scrapy.selector import Selector

In [4]: response.selector.remove_namespaces()

In [5]: text = html.unescape(response.text)

In [6]: sel = Selector(text=text)

In [7]: sel.xpath('//div')
Out[7]:
[<Selector xpath='//div' data='<div id="\\\'DivPROWContainer\\\'" class=...'>,
 <Selector xpath='//div' data='<div id="\\\'DivTableGCCDocsHolder\\\'" c...'>,
 <Selector xpath='//div' data='<div class="\\\'PROWDefaultFooter\\\'">\\n...'>,
 <Selector xpath='//div' data='<div class="\\\'PROWFooter1\\\'">© 2014-2...'>,
 <Selector xpath='//div' data='<div class="\\\'PROWFooter2\\\'"><strong>...'>,
 <Selector xpath='//div' data='<div class="\\\'PROWFooter2\\\'">\\n<a id=...'>]

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM