[英]Why am I not getting the text in the XML tag? - python elementtree
如何讀取<context>...</context>
標記中的所有文本? 那么<context \\>
標記中的<head>...<\\head>
標記又如何呢?
我有一個看起來像這樣的XML文件:
<corpus lang="english">
<lexelt item="coach.n">
<instance id="1">
<context>I'll buy a train or <head>coach</head> ticket.</context>
</instance>
<instance id="2">
<context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
</instance>
</lexelt>
</corpus>
但是,當我運行我的代碼以讀取...中的XML文本時,我只會得到文本,直到到達標簽為止。
import xml.etree.ElementTree as et
inputfile = "./coach.data"
root = et.parse(open(inputfile)).getroot()
instances = []
for corpus in root:
for lexelt in corpus:
for instance in lexelt:
instances.append(instance.text)
j=1
for i in instances:
print "instance " + j
print "left: " + i
print "\n"
j+=1
現在我只剩下左側:
instance 1
left: I'll buy a train or
instance 2
left: A branch line train took us to Aubagne where a
輸出還需要上下文和標題的右側,它應該是:
instance 1
left: I'll buy a train or
head: coach
right: ticket.
instance 2
left: A branch line train took us to Aubagne where a
head: coach
right: picked us up for the journey up to the camp.
首先,您的代碼有誤。 for corpus in root
不是必需的,所以您的根目錄已經是corpus
。
您可能打算做的是:
for lexelt in root:
for instance in lexelt:
for context in instance:
contexts.append(context.text)
現在,關於您的問題- for context in instance
塊中的for context in instance
,您可以訪問所需的其他兩個字符串:
head
文本可以通過訪問進行訪問context.find('head').text
context.find('head').tail
來讀取head
元素右側的文本:
tail
屬性可用於保存與元素關聯的其他數據。 此屬性通常是字符串,但可以是任何特定於應用程序的對象。 如果該元素是從XML文件創建的,則該屬性將包含在元素的end標記之后和下一個標記之前找到的所有文本。
在ElementTree中,您將不得不考慮子節點的tail屬性。 語料庫也是您的根。
import xml.etree.ElementTree as et inputfile = "./coach.data" corpus = et.parse(open(inputfile)).getroot() def getalltext(elem): return elem.text + ''.join([getalltext(child) + child.tail for child in elem]) instances = [] for lexelt in corpus: for instance in lexelt: instances.append(getalltext(instance)) j=1 for i in instances: print "instance " + j print "left: " + i print "\n" j+=1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.