简体   繁体   English

如何用lxml中的文本替换元素?

[英]How can one replace an element with text in lxml?

It's easy to completely remove a given element from an XML document with lxml's implementation of the ElementTree API, but I can't see an easy way of consistently replacing an element with some text. 使用lxml的ElementTree API实现从XML文档中完全删除给定元素很容易,但是我看不到用一些文本一致地替换元素的简单方法。 For example, given the following input: 例如,给出以下输入:

input = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

... you could easily remove every <r> element with: ...您可以轻松删除每个<r>元素:

from lxml import etree
f = etree.fromstring(data)
for r in f.xpath('//r'):
    r.getparent().remove(r)
print etree.tostring(f, pretty_print=True)

However, how would you go about replacing each element with text, to get the output: 但是,你将如何用文本替换每个元素,以获得输出:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/>Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

It seems to me that because the ElementTree API deals with text via the .text and .tail attributes of each element rather than nodes in the tree, this means you have to deal with a lot of different cases depending on whether the element has sibling elements or not, whether the existing element had a .tail attribute, and so on. 在我看来,因为ElementTree API通过每个元素的.text.tail属性处理文本而不是树中的节点,这意味着你必须处理很多不同的情况,具体取决于元素是否有兄弟元素或不,现有元素是否具有.tail属性,等等。 Have I missed some easy way of doing this? 我错过了一些简单的方法吗?

I think that unutbu's XSLT solution is probably the correct way to achieve your goal. 我认为unutbu的XSLT解决方案可能是实现目标的正确方法。

However, here's a somewhat hacky way to achieve it, by modifying the tails of <r/> tags and then using etree.strip_elements . 但是,通过修改<r/>标签的尾部然后使用etree.strip_elements ,这是实现它的一种有点hacky方式。

from lxml import etree

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f = etree.fromstring(data)
for r in f.xpath('//r'):
  r.tail = 'DELETED' + r.tail if r.tail else 'DELETED'

etree.strip_elements(f,'r',with_tail=False)

print etree.tostring(f,pretty_print=True)

Gives you: 给你:

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

Using strip_elements has the disadvantage that you cannot make it keep some of the <r> elements while replacing others. 使用strip_elements的缺点是,在替换其他元素时,不能保留一些<r>元素。 It also requires the existence of an ElementTree instance (which may be not the case). 它还需要存在一个ElementTree实例(可能不是这种情况)。 And last, you cannot use it to replace XML comments or processing instructions. 最后,您不能使用它来替换XML注释或处理指令。 The following should do your job: 以下应该做你的工作:

for r in f.xpath('//r'):
    text = 'DELETED' + r.tail 
    parent = r.getparent()
    if parent is not None:
        previous = r.getprevious()
        if previous is not None:
            previous.tail = (previous.tail or '') + text
        else:
            parent.text = (parent.text or '') + text
        parent.remove(r)

Using ET.XSLT : 使用ET.XSLT

import io
import lxml.etree as ET

data = '''<everything>
<m>Some text before <r/></m>
<m><r/> and some text after.</m>
<m><r/></m>
<m>Text before <r/> and after</m>
<m><b/> Text after a sibling <r/> Text before a sibling<b/></m>
</everything>
'''

f=ET.fromstring(data)
xslt='''\
    <xsl:stylesheet version="1.0"
        xmlns:xsl="http://www.w3.org/1999/XSL/Transform">    

    <!-- Replace r nodes with DELETED
         http://www.w3schools.com/xsl/el_template.asp -->
    <xsl:template match="r">DELETED</xsl:template>

    <!-- How to copy XML without changes
         http://mrhaki.blogspot.com/2008/07/copy-xml-as-is-with-xslt.html -->    
    <xsl:template match="*">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()"/>
        </xsl:copy>
    </xsl:template>
    <xsl:template match="@*|text()|comment()|processing-instruction">
        <xsl:copy-of select="."/>
    </xsl:template>
    </xsl:stylesheet>
'''

xslt_doc=ET.parse(io.BytesIO(xslt))
transform=ET.XSLT(xslt_doc)
f=transform(f)

print(ET.tostring(f))

yields 产量

<everything>
<m>Some text before DELETED</m>
<m>DELETED and some text after.</m>
<m>DELETED</m>
<m>Text before DELETED and after</m>
<m><b/> Text after a sibling DELETED Text before a sibling<b/></m>
</everything>

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM