如何使用 lxml 選擇和更新混合內容中的文本節點？

Question

我需要檢查 XML 文件中所有text()節點中的所有單詞。 我正在使用 XPath //text()來選擇文本節點和一個正則表達式來選擇單詞。 如果該詞存在於一組關鍵字中，我需要將其替換為某些內容並更新 XML。

通常使用.text設置元素的文本，但 _Element 上的.text只會更改第一個子文本節點。 在混合內容元素中，其他文本節點實際上是其前一個兄弟節點的.tail 。

如何更新所有文本節點？

在下面的簡化示例中，我只是想將匹配的關鍵字括在方括號中...

輸入 XML

<doc>
    <para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
        and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>

期望輸出

<doc>
    <para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
        and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>

Answer 1

我在文檔中找到了這個解決方案的關鍵： Using XPath to find text

具體地， is_text和is_tail的性質_ElementUnicodeResult 。

使用這些屬性，我可以判斷是否需要更新父_Element的.text或.tail屬性。

起初這有點難以理解，因為當您在文本節點（ _ElementUnicodeResult getparent()上使用getparent() ，它是其前一個兄弟節點（ .is_tail == True ）的尾部，前一個兄弟.is_tail == True是作為父節點返回的； 不是真正的父母。

例子...

Python

import re
from lxml import etree

xml = """<doc>
    <para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
        and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""


def update_text(match, word_list):
    if match in word_list:
        return f"[{match}]"
    else:
        return match


root = etree.fromstring(xml)

keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}

for text in root.xpath("//text()"):
    parent = text.getparent()
    updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
    if text.is_text:
        parent.text = updated_text
    elif text.is_tail:
        parent.tail = updated_text

etree.dump(root)

輸出（轉儲到控制台）

<doc>
    <para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
        better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
        sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
        and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
        misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>

如何使用 lxml 選擇和更新混合內容中的文本節點？

問題描述

1 個解決方案

解決方案1
3 已采納 2019-08-01 04:18:46

如何使用 lxml 選擇和更新混合內容中的文本節點？

問題描述

1 個解決方案

解決方案1 3 已采納 2019-08-01 04:18:46

解決方案1
3 已采納 2019-08-01 04:18:46