簡體   English   中英

python xml.etree.ElementTree 刪除文本中間的空標簽

[英]python xml.etree.ElementTree remove empty tag in the middle of text

我有一個 xml 文檔,我想根據標簽從中提取文本。
我想從中提取文本的部分看起來像這樣:

<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>

當我做

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = text.text

我只能抓取空標簽<TIP CONTENT=""/>之前的部分
我試圖在獲取其余文本之前刪除此標簽。
我做了:

emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
    root.remove(e)

但這是行不通的。
<BlockText><TIP>都不是 root 的直接子代。


謝謝你。

<TIP CONTENT=""/>之后的文本屬於它自己的尾部而不是BlockText標簽的文本。

elem.text是 open 標簽后面的文本。 elem.tail是關閉標簽之后的文本。 通常是空白,但在這種情況下它有實際的文本。

好的,這就是最終對我有用的東西:

emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
    emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
    print(element.tail)

但是我仍然無法將文本作為一個整體(相同的順序)獲取。 我可以獲得所有的 BlockText 標簽和所有的 TIP 標簽,但不能同時獲得。

更新 :
我用了 :

tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
    texte = ''.join(text.itertext())

另一種解決方案僅供參考

from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="­"/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))

結果:

{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM