使用Python和lxml仅剥离具有特定属性/值的标记

Question

我熟悉etree的strip_tags和strip_elements方法，但我正在寻找一种直接剥离标记（并保留其内容）的方法，它只包含特定的属性/值。

例如：我想从树（ xhtm l）中剥离所有具有class='myclass'属性/值的span或div标签（或其他元素）（保留像strip_tags那样的元素内容）。 同时，那些没有 class='myclass'相同元素应保持不变。

相反：我想要一种方法来剥离树上的所有“裸” spans或divs 。 仅表示那些绝对没有属性的spans / divs （或任何其他元素）。 留下那些具有属性（任何）的相同元素不变。

我觉得我错过了一些明显的东西，但是我一直在寻找没有任何运气的时间。

Answer 1

HTML

lxml的HTML元素有一个方法drop_tag() ，你可以调用lxml.html解析的树中的任何元素。

它的作用类似strip_tags ，因为它移除元素，但保留了文本，它可以在元件上被称为-这意味着你可以轻松地选择你不感兴趣，与元素的XPath在他们的表情，然后循环并删除它们：

doc.html

<html>
    <body>
        <div>This is some <span attr="foo">Text</span>.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get <span attr="foo">removed</span> as well.</div>
        <div>Nested elements <span attr="foo">will <b>be</b> left</span> alone.</div>
        <div>Unless <span attr="foo">they <span attr="foo">also</span> match</span>.</div>
    </body>
</html>

strip.py

from lxml import etree
from lxml import html

doc = html.parse(open('doc.html'))
spans_with_attrs = doc.xpath("//span[@attr='foo']")

for span in spans_with_attrs:
    span.drop_tag()

print etree.tostring(doc)

输出：

<html>
    <body>
        <div>This is some Text.</div>
        <div>Some <span>more</span> text.</div>
        <div>Yet another line <span attr="bar">of</span> text.</div>
        <div>This span will get removed as well.</div>
        <div>Nested elements will <b>be</b> left alone.</div>
        <div>Unless they also match.</div>
    </body>
</html>

在这种情况下，XPath表达式//span[@attr='foo']选择具有值为foo的属性attr的所有span元素。 有关如何构造XPath表达式的更多详细信息，请参阅此XPath教程。

XML / XHTML

编辑： 我刚刚注意到你在你的问题中特别提到了XHTML，根据文档更好地解析为XML。 不幸的是， drop_tag()方法实际上只适用于HTML文档中的元素。

所以对于XML来说，它有点复杂：

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree


def strip_nodes(nodes):
    for node in nodes:
        text_content = node.xpath('string()')

        # Include tail in full_text because it will be removed with the node
        full_text = text_content + (node.tail or '')

        parent = node.getparent()
        prev = node.getprevious()
        if prev:
            # There is a previous node, append text to its tail
            prev.tail += full_text
        else:
            # It's the first node in <parent/>, append to parent's text
            parent.text = (parent.text or '') + full_text
        parent.remove(node)


doc = etree.parse(open('doc.xml'))
nodes = doc.xpath("//span[@attr='foo']")
strip_nodes(nodes)

print etree.tostring(doc)

输出：

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this first span should <span>be</span> removed.</node>
</document>

如您所见，这将使用递归文本内容替换节点及其所有子节点。 我真的希望这是你想要的，否则事情变得更加复杂;-)

注意上次编辑已更改相关代码。

Answer 2

我只是遇到了同样的问题，并且在经过一些考虑之后有了这个相当愚蠢的想法，这是从Perl在线工具中的正则表达式借用：如何首先使用element.iterfind带来的所有功能捕获所有不需要的元素，将这些元素重命名为不太可能的东西，然后剥去所有这些元素？

是的，这并不是绝对干净和健壮，因为您总是可能有一个实际使用您选择的“不太可能”的标记名称的文档，但生成的代码相当干净且易于维护。 如果您确实需要确保文件中已经不存在您选择的任何“不太可能”的名称，您可以随时检查它是否已存在，并且仅在您找不到任何预先存在的情况下进行重命名该名称的标签。

doc.xml

<document>
    <node>This is <span>some</span> text.</node>
    <node>Only this <span attr="foo">first <b>span</b></span> should <span>be</span> removed.</node>
</document>

strip.py

from lxml import etree
xml = etree.parse("doc.xml")
deltag ="xxyyzzdelme"
for el in xml.iterfind("//span[@attr='foo']"):
    el.tag = deltag
etree.strip_tag(xml, deltag)
print(etree.tostring(xml, encoding="unicode", pretty_print=True))

产量

<document>
     <node>This is <span>some</span> text.</node>
     <node>Only this first <b>span</b> should <span>be</span> removed.</node>
</document>

Answer 3

我也有同样的问题。 但在我的情况下，场景更容易，我有一个选项 - 不删除标签，只是清除它，我们的用户看到渲染的HTML，如果我有例如

<div>Hello <strong>awesome</strong> World!</div>

我想通过css选择器div > strong保存strong标签并保存尾部上下文，在lxml中你不能使用带有keep_tail的strip_tags选择器，你只能通过标签剥离，它让我发疯。 如果您只是删除<strong>awesome</strong>节点，那么您还可以删除此尾部 - “World！”，包含strong标记的文本。 输出将如下：

<div>Hello</div>

对我来说好吧：

<div>Hello <strong></strong> World!</div>

没有真棒用户了。

doc = lxml.html.fromstring(markup)
selector = lxml.cssselect.CSSSelector('div > strong')
for el in list(selector(doc)):
    if el.tail:
        tail = el.tail
        el.clear()
        el.tail = tail
    else:
        #if no tail, we can safety just remove node
        el.getparent().remove(el)

您可以使用物理删除strong标记调整代码，使用call element.remove(child)并将其尾部附加到父级，但对于我的情况，它是开销。

使用Python和lxml仅剥离具有特定属性/值的标记

问题描述

3 个解决方案

解决方案1
11 2014-02-10 20:02:01

HTML

XML / XHTML

解决方案2
1 2016-02-13 13:49:53

解决方案3
0 2016-01-13 17:13:31

使用Python和lxml仅剥离具有特定属性/值的标记

问题描述

3 个解决方案

解决方案1 11 2014-02-10 20:02:01

HTML

XML / XHTML

解决方案2 1 2016-02-13 13:49:53

解决方案3 0 2016-01-13 17:13:31

解决方案1
11 2014-02-10 20:02:01

解决方案2
1 2016-02-13 13:49:53

解决方案3
0 2016-01-13 17:13:31