![](/img/trans.png)
[英]lxml: adding multiple mixed content (text + elements) child nodes to a parent element
[英]How can I select and update text nodes in mixed content using lxml?
我需要檢查 XML 文件中所有text()
節點中的所有單詞。 我正在使用 XPath //text()
來選擇文本節點和一個正則表達式來選擇單詞。 如果該詞存在於一組關鍵字中,我需要將其替換為某些內容並更新 XML。
通常使用.text
設置元素的文本,但 _Element 上的.text
只會更改第一個子文本節點。 在混合內容元素中,其他文本節點實際上是其前一個兄弟節點的.tail
。
如何更新所有文本節點?
在下面的簡化示例中,我只是想將匹配的關鍵字括在方括號中...
輸入 XML
<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
期望輸出
<doc>
<para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
我在文檔中找到了這個解決方案的關鍵: Using XPath to find text
具體地, is_text
和is_tail
的性質_ElementUnicodeResult 。
使用這些屬性,我可以判斷是否需要更新父_Element的.text
或.tail
屬性。
起初這有點難以理解,因為當您在文本節點( _ElementUnicodeResult
getparent()
上使用getparent()
,它是其前一個兄弟節點( .is_tail == True
)的尾部,前一個兄弟.is_tail == True
是作為父節點返回的; 不是真正的父母。
例子...
Python
import re
from lxml import etree
xml = """<doc>
<para>I think the only card she has <gotcha>is the</gotcha> Lorem card. We have so many things that we have to do
better... and certainly ipsum is one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending the best. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of problems
and they're <gotcha>bringing</gotcha> those problems with us. They're bringing mistakes. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
"""
def update_text(match, word_list):
if match in word_list:
return f"[{match}]"
else:
return match
root = etree.fromstring(xml)
keywords = {"ipsum", "is", "the", "best", "problems", "mistakes"}
for text in root.xpath("//text()"):
parent = text.getparent()
updated_text = re.sub(r"[\w]+", lambda match: update_text(match.group(), keywords), text)
if text.is_text:
parent.text = updated_text
elif text.is_tail:
parent.tail = updated_text
etree.dump(root)
輸出(轉儲到控制台)
<doc>
<para>I think [the] only card she has <gotcha>[is] [the]</gotcha> Lorem card. We have so many things that we have to do
better... and certainly [ipsum] [is] one of them. When other <gotcha>websites</gotcha> give you text, they're not
sending [the] [best]. They're not sending you, they're <gotcha>sending words</gotcha> that have lots of [problems]
and they're <gotcha>bringing</gotcha> those [problems] with us. They're bringing [mistakes]. They're bringing
misspellings. They're typists… And some, <gotcha>I assume</gotcha>, are good words.</para>
</doc>
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.