[英]python xml.etree.ElementTree remove empty tag in the middle of text
我有一個 xml 文檔,我想根據標簽從中提取文本。
我想從中提取文本的部分看起來像這樣:
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
當我做
tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
texte = text.text
我只能抓取空標簽<TIP CONTENT=""/>
之前的部分
我試圖在獲取其余文本之前刪除此標簽。
我做了:
emptyTag = list(filter(lambda i: "TIP" in i, tags))
for e in root.iter(emptyTag) :
root.remove(e)
但這是行不通的。
<BlockText>
和<TIP>
都不是 root 的直接子代。
謝謝你。
<TIP CONTENT=""/>
之后的文本屬於它自己的尾部而不是BlockText
標簽的文本。
elem.text
是 open 標簽后面的文本。 elem.tail
是關閉標簽之后的文本。 通常是空白,但在這種情況下它有實際的文本。
好的,這就是最終對我有用的東西:
emptyTags = list(filter(lambda i: "TIP" in i, tags))
if emptyTags :
emptyTag = list(filter(lambda i: "TIP" in i, emptyTags))[0]
for element in root.iter(emptyTag):
print(element.tail)
但是我仍然無法將文本作為一個整體(相同的順序)獲取。 我可以獲得所有的 BlockText 標簽和所有的 TIP 標簽,但不能同時獲得。
更新 :
我用了 :
tree = ET.parse("myfile.xml")
root = tree.getroot()
tags = list(set([elem.tag for elem in root.iter()]))
tag = list(filter(lambda i: "BlockText" in i, tags))[0]
for text in root.iter(tag):
texte = ''.join(text.itertext())
另一種解決方案僅供參考
from simplified_scrapy import SimplifiedDoc
html = '''
<BlockText attr1="blah" attr2=657 ID="Bhf76" lang="en">
Simply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT=""/>\n five centuries, electronic typesetting, remaining essentially release.
</BlockText>
'''
doc = SimplifiedDoc(html)
print (doc.select('BlockText'))
print (doc.select('BlockText>text()'))
print (doc.selects('BlockText>text()'))
結果:
{'tag': 'BlockText', 'attr1': 'blah', 'attr2': '657', 'ID': 'Bhf76', 'lang': 'en', 'html': '\nSimply dummy text of the printing and typesetting industry. It has survived not only<TIP CONTENT="\xad" />\n five centuries, electronic typesetting, remaining essentially release.\n'}
Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.
['Simply dummy text of the printing and typesetting industry. It has survived not only five centuries, electronic typesetting, remaining essentially release.']
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.