[英]Converting xml to html using Python
我有这样的页面:
<?xml version="1.0" encoding="utf-8"?>\r\n<HTMLReturn xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://gccwebapps/PROWWS/">\r\n <Result>OK</Result>\r\n <ErrorMessageNewLine>\n</ErrorMessageNewLine>\r\n <ErrorMessage />\r\n <ID />\r\n <HTML><div id=\'DivPROWContainer\' class=\'PROWContainer\'>\n<div id=\'DivTableGCCDocsHolder\' class=\'TableGCCDocsHolder\'>\n<table id=\'TableDisplayTable\' class=\'DisplayTable DisplayGCCDocsTable HtmlDataTable\'>\n<tbody>\n<tr class=\'DisplayTableHeaderRow HtmlDataTableHeaderRow DisplayTableTopRow\'>\n<th colspan=\'5\'>Documents available for the planning Application</th>\n</tr>\n<tr class=\'DisplayTableHeaderRow HtmlDataTableHeaderRow\'>\n<th>Application Number</th>\n<th>Plan number</th>\n<th>Document type</th>\n<th>Description</th>\n<th>Date Entered</th>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_DEC_LET.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td></td>\n<td>Text</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_DEC_LET.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Decision Letter\n</a></td>\n<td>26/01/2022</td>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_APP_FORM_RED.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td></td>\n<td>Plan</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_APP_FORM_RED.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Application Form 9Redacted)\n</a></td>\n<td>10/01/2022</td>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_LAND_PLAN_P20_2956_05D.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td>P20_2956_05D</td>\n<td>Text</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_LAND_PLAN_P20_2956_05D.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Landscape MasterPlan 04.01.22\n</a></td>\n<td>10/01/2022</td>\n</tr>\n<tr class=\'DisplayTableDataRow HtmlDataTableRow ResultRowAlternative\'>\n<td><a id=\'AFormLink_APP_NO\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_ELEC_SERV_190123_SC_XX_XX_DR_E_600.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>22/0001/NONMAT\n</a></td>\n<td>190123_SC_XX_XX_DR_E_600</td>\n<td>Plan</td>\n<td><a id=\'AFormLink_DESCRIPTION\' class=\'FormHyperLink\' href=\'https://ww3.gloucestershire.gov.uk/PROW/PROWWS.asmx/GetFileGCCContents?Filename=images%2f22_0001_NONMAT_ELEC_SERV_190123_SC_XX_XX_DR_E_600.PDF\' data-DisableMeWhenSomethingChanged=\'1\' target=\'_blank\' rel=\'noopener noreferrer\'>Electrical Services Site Wide\n</a></td>\n<td>10/01/2022</td>\n</tr>\n</tbody>\n\n</table>\n</div>\n<div class=\'PROWDefaultFooter\'>\n<div class=\'PROWFooter1\'>© 2014-21 Gloucestershire County Council, Shire Hall, Westgate Street, Gloucester GL1 2TG.\n</div>\n<div class=\'PROWFooter2\'><STRONG>Telephone:</STRONG>+44(0)1452 425000 - <STRONG> Out of hours:</STRONG> +44(0)845 6677788\n</div>\n<div class=\'PROWFooter2\'>\n<a id=\'AGCCLink\' class=\'GCCFooterLink\' href=\'http://www.gloucestershire.gov.uk\' data-DisableMeWhenSomethingChanged=\'1\'>www.gloucestershire.gov.uk\n</a>\n</div>\n</div>\n</div>\n</HTML>\r\n <Script>gcc_docs_startScreenSetup();</Script>\r\n</HTMLReturn>
我需要使用 xpath(没有命名空间)在其中查找元素。 我尝试了不同的变体,但我收到了一些非常短且空的 output(5-6 字节):
那是我尝试过的变种。 如您所见 - 它们都不起作用。
import lxml.html as html
res = html.fromstring(sec_response.body)
len(res)
5
res.xpath('//div')
[]
import xml.etree.ElementTree as ET
xhtml = ET.fromstring(sec_response.text)
len(xhtml)
6
xhtml.xpath('//div')
*** AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'xpath'
from lxml import etree
xslt_root = etree.XML(sec_response.body)
len(xslt_root)
6
xslt_root.xpath('//div')
[]
sec_response.selector.remove_namespaces()
sec_response.xpath('//td')
[]
sec_response.xpath('//tr')
[]
请展示转换它的方法,以便 xpath 可以使用它(我需要查找 //tr 或 //td 或 //a 元素并找到它)。
scrapy shell file:///....../temp.xml # your page's code
In [1]: response.xpath('//div')
Out[1]: []
In [2]: import html
In [3]: from scrapy.selector import Selector
In [4]: response.selector.remove_namespaces()
In [5]: text = html.unescape(response.text)
In [6]: sel = Selector(text=text)
In [7]: sel.xpath('//div')
Out[7]:
[<Selector xpath='//div' data='<div id="\\\'DivPROWContainer\\\'" class=...'>,
<Selector xpath='//div' data='<div id="\\\'DivTableGCCDocsHolder\\\'" c...'>,
<Selector xpath='//div' data='<div class="\\\'PROWDefaultFooter\\\'">\\n...'>,
<Selector xpath='//div' data='<div class="\\\'PROWFooter1\\\'">© 2014-2...'>,
<Selector xpath='//div' data='<div class="\\\'PROWFooter2\\\'"><strong>...'>,
<Selector xpath='//div' data='<div class="\\\'PROWFooter2\\\'">\\n<a id=...'>]
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.