简体   繁体   English

如何使用xpath仅选择某些标签和文本?

[英]How to select only certain tag and text using xpath?

For example, html block: 例如,html块:

<p><b>text1</b> (<span><a href="#1">asdf</a>text2</span>)</p>

I need to select all tags "a" and all the rest must be the plain text just like we see in browser: 我需要选择所有标签“ a”,所有其余标签必须为纯文本,就像我们在浏览器中看到的那样:

result = ["text1", " (", <tag_a>, "text2", ")"]

or something like that. 或类似的东西。

Tried: 尝试过:

hxs.select('.//a|text()')

in this case it finds all tags "a" but text is returned only from direct children. 在这种情况下,它将找到所有标记“ a”,但仅从直接子代返回文本。

At the same time: 与此同时:

hxs.select('.//text()|a')

gets all texts, but tags "a" only from direct children. 获取所有文本,但标记“ a”仅来自直接子代。

UPDATE 更新

    elements = []
    for i in hxs.select('.//node()'):
        try:
            tag_name = i.select('name()').extract()[0]
        except TypeError:
            tag_name = '_text'

        if tag_name == 'a':
            elements.append(i)
        elif tag_name == '_text':
            elements.append(i.extract())

is there a better way? 有没有更好的办法?

It looks to me as if you are stepping beyond XPath territory. 在我看来,您似乎正在超越XPath领域。 XPath is good at selecting things from the input but not at constructing output. XPath擅长从输入中选择内容,而不擅长构建输出。 It was designed, of course, for use with XSLT where XSLT instructions handle the output side. 当然,它是为与XSLT一起使用而设计的,其中XSLT指令处理输出端。 I'm not sure what the Python equivalent would be. 我不确定Python会是什么。

Is this the kind of thing you're looking for? 这是您要找的东西吗?

You can remove the descendant tags from the block using etree.strip_tags 您可以使用etree.strip_tags从块中删除后代标签

from lxml import etree
d = etree.HTML('<html><body><p><b>text1</b> (<span><a href="#1">asdf</a>text2</span>)</p></body></html>')
block = d.xpath('/html/body/p')[0]
# etree.strip_tags apparently takes a list of tags to strip, but it wasn't working for me
for tag in set(x.tag for x in block.iterdescendants() if x.tag != 'a'):
  etree.strip_tags(block,tag)

block.xpath('./text()|a')

Yields: 产量:

['text1', ' (', <Element a at fa4a48>, 'text2', ')']

These relative XPath expressions: 这些相对的 XPath表达式:

.//text()|.//a

Or 要么

.//node()[self::text()|self::a]

Meanning : all descendant text nodes or a elements from the context node. 含义所有后代文本节点或上下文节点中a元素。

Note : It's up to the host language or the XPath engine whether this node set result is ordered by document order or not. 注意 :节点设置结果是否按文档顺序排序取决于主机语言或XPath引擎。 By definition, node sets are unorderd. 根据定义,节点集是无序的。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM