[英]Namespace argument in lxml parsing
I have an html page that I am trying to parse. 我有一个要解析的HTML页面。 Here is what I'm doing with lxml: 这是我对lxml的处理方式:
node=etree.fromstring(html)
>>> node
<Element {http://www.w3.org/1999/xhtml}html at 0x110676a70>
>>> node.xpath('//body')
[]
>>> node.xpath('body')
[]
Unfortunately, all my xpath calls are now returning an empty list. 不幸的是,我所有的xpath调用现在都返回一个空列表。 Why is this occurring and how would I fix this call? 为什么会发生这种情况,我将如何解决此呼叫?
You can add a namespace here, as follows: 您可以在此处添加名称空间,如下所示:
>>> node.xpath('//xmlns:tr', namespaces={'xmlns':'http://www.w3.org/1999/xhtml'})
[<Element {http://www.w3.org/1999/xhtml}tr at 0x11067b6c8>, <Element {http://www.w3.org/1999/xhtml}tr at 0x11067b710>]
And a better way to do it would be with using lxml's
html parser: 更好的方法是使用lxml's
html解析器:
>>> node=lxml.html.fromstring(html)
>>> node.findall('body')
[<Element body at 0x1106b8f18>]
You need to use the namespace prefix while querying. 查询时需要使用名称空间前缀。 like 喜欢
node.xpath('//html:body', namespaces={'html': 'http://...'})
or you can use the .nsmap
或者您可以使用.nsmap
node.xpath('//html:body', namespaces=node.nsmap)
This assumes all the namespaces are defined on tag pointed by node
. 假设所有名称空间均在node
指向的标记上定义。 This is usually true for most xml
documents. 对于大多数xml
文档来说通常都是这样。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.