python在xml中刪除非標簽

Question

我想刪除所有不在xml標記中的內容（清除），並可以選擇將其放在列表中。 我有一些這樣的XML：

<tag>some text</tag> unwanted text <tag>some text</tag>

我想用python（正則表達式）

('<tag>some text</tag>','<tag>some text</tag>')

我嘗試了：

cleanup = re.findall(r"^<.>.*</.>$",  input)

但我認為整個輸入也匹配正則表達式，我該如何解決呢？

UPDATE1：

我嘗試加載它

import xml.etree.ElementTree as ET
root = ET.fromstring(str(cleanup))

Answer 1

只想擴展已經在這里回答的內容，因為我認為正確的方法不是使用正則表達式來處理類似xml的內容。 您應該使用XML解析器， 不需要的內容稱為tail ，可以在解析時進行清理，這是一種實現方式：

import xml.etree.ElementTree as ET

s = '''<root><tag>some text</tag> unwanted text <tag>some text</tag></root>'''

tree = ET.fromstring(s)

cleaned_tree = []

for node in tree:
    node.tail = ''
    cleaned_tree.append(ET.tostring(node))

print cleaned_tree # or print(cleaned_tree) if Python 3
['<tag>some text</tag>', '<tag>some text</tag>']

附帶說明：您可能會看一下str（cleanup） ，發現它的示例中缺少像root這樣的標簽。 它失敗fromstring（）可能暗示您的xml源有問題。

python在xml中刪除非標簽

問題描述

1 個解決方案

解決方案1
2 已采納 2015-01-18 00:03:31

python在xml中刪除非標簽

問題描述

1 個解決方案

解決方案1 2 已采納 2015-01-18 00:03:31

解決方案1
2 已采納 2015-01-18 00:03:31