简体   繁体   English

如何使用xpath&lxml获取节点的全部内容?

[英]how to get the full contents of a node using xpath & lxml?

I am using lxml's xpath function to retrieve parts of a webpage. 我正在使用lxml的xpath函数来检索网页的各个部分。 I am trying to get contents of a <font> tag, which includes html tags of its own. 我正在尝试获取<font>标记的内容,其中包含自己的html标记。 If I use 如果我使用

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]

I get the right amount of nodes, but they are returned as lxml objects ( <Element font at 0x101fe5eb0> ). 我获得了正确数量的节点,但它们作为lxml对象返回( <Element font at 0x101fe5eb0> )。

If I use 如果我使用

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/text()

I get exactly what I want, except that I don't get any of the HTML code which is contained within the <font> nodes. 我得到了我想要的,除了我没有得到<font>节点中包含的任何HTML代码。

If I use 如果我使用

//td[@valign="top"]/p[1]/font[@face="verdana" and @color="#ffffff" and @size="2"]/node()

if get a mixture of text and lxml elements! 如果得到文本和lxml元素的混合! (eg something something <Element a at 0x102ac2140> something ) (例如something something <Element a at 0x102ac2140> something

Is there anyway to use a pure XPath query to get the contents of the <font> nodes, or even to force lxml to return a string of the contents from the .xpath() method, rather than an lxml object? 无论如何使用纯XPath查询来获取<font>节点的内容,甚至强制lxml从.xpath()方法返回内容字符串,而不是lxml对象?

Note that I'm returning a list of many nodes from the XPath query so the solution needs to support that. 请注意,我正在从XPath查询返回许多节点的列表,因此解决方案需要支持该节点。

just to clarify... i want to return something something <a href="url">inside</a> something from something like... 只是为了澄清......我希望在something something <a href="url">inside</a> something中返回something something <a href="url">inside</a> something ......

<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>

Is there anyway to use a pure XPath query to get the contents of the <font> nodes, or even to force lxml to return a string of the contents from the .xpath() method, rather than an lxml object? 无论如何使用纯XPath查询来获取<font>节点的内容,甚至强制lxml从.xpath()方法返回内容字符串,而不是lxml对象?

Note that I'm returning a list of many nodes from the XPath query so the solution needs to support that. 请注意,我正在从XPath查询返回许多节点的列表,因此解决方案需要支持该节点。

just to clarify... i want to return something something <a href="url">inside</a> something from something like... 只是为了澄清......我希望在某些内容中返回<a href="url">inside</a> something ......

 <font face="verdana" color="#ffffff" size="2"><a 

href="url">inside something href =“url”>里面的东西

Short answer : No. 简答 :不。

XPath doesn't work on "tags" but with nodes XPath不适用于“标签”,但适用于节点

The selected nodes are represented as instances of specific objects in the language that is hosting XPath. 所选节点表示为托管XPath的语言中的特定对象的实例。

In case you need the string representation of a particular node's markup, such objects typically support an outerXML property -- check the documentation of the hosting language (lxml in this case). 如果您需要特定节点标记的字符串表示,此类对象通常支持outerXML属性 - 请检查托管语言的文档(在本例中为lxml)。

As @Robert-Rossney pointed out in his comment: lxml's tostring() method is equivalent to other environments' outerXml property . 正如@ Robert-Rossney在他的评论中指出的那样:lxml的tostring()方法等同于其他环境的outerXml属性

I'm not sure I understand -- is this close to what you are looking for? 我不确定我理解 - 这接近你想要的吗?

import lxml.etree as le
import cStringIO
content='''\
<font face="verdana" color="#ffffff" size="2"><a href="url">inside</a> something</font>
'''
doc=le.parse(cStringIO.StringIO(content))

xpath='//font[@face="verdana" and @color="#ffffff" and @size="2"]/child::*'
x=doc.xpath(xpath)
print(map(le.tostring,x))
# ['<a href="url">inside</a> something']

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM