python xml.etree.ElementTree 刪除文本中間的空標簽

Question

我有一個 xml 文檔，我想根據標簽從中提取文本。
我想從中提取文本的部分看起來像這樣：

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

當我做

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

我只能抓取空標簽<TIP CONTENT=""/>之前的部分
我試圖在獲取其余文本之前刪除此標簽。
我做了：

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

但這是行不通的。
<BlockText>和<TIP>都不是 root 的直接子代。

謝謝你。

Answer 1

<TIP CONTENT=""/>之后的文本屬於它自己的尾部而不是BlockText標簽的文本。

elem.text是 open 標簽后面的文本。 elem.tail是關閉標簽之后的文本。 通常是空白，但在這種情況下它有實際的文本。

Answer 2

好的，這就是最終對我有用的東西：

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

但是我仍然無法將文本作為一個整體（相同的順序）獲取。 我可以獲得所有的 BlockText 標簽和所有的 TIP 標簽，但不能同時獲得。

更新：
我用了：

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())

Answer 3

另一種解決方案僅供參考

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

結果：

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

python xml.etree.ElementTree 刪除文本中間的空標簽

問題描述

3 個解決方案

解決方案1
0 2020-02-20 14:34:55

解決方案2
0 2020-02-20 15:18:45

解決方案3
0 2020-02-26 00:16:53

python xml.etree.ElementTree 刪除文本中間的空標簽

問題描述

3 個解決方案

解決方案1 0 2020-02-20 14:34:55

解決方案2 0 2020-02-20 15:18:45

解決方案3 0 2020-02-26 00:16:53

解決方案1
0 2020-02-20 14:34:55

解決方案2
0 2020-02-20 15:18:45

解決方案3
0 2020-02-26 00:16:53