簡體   English   中英

為什么我沒有在XML標簽中得到文本? -python elementtree

[英]Why am I not getting the text in the XML tag? - python elementtree

如何讀取<context>...</context>標記中的所有文本? 那么<context \\>標記中的<head>...<\\head>標記又如何呢?

我有一個看起來像這樣的XML文件:

<corpus lang="english">
    <lexelt item="coach.n">
        <instance id="1">
            <context>I'll buy a train or <head>coach</head> ticket.</context>
        </instance>
        <instance id="2">
            <context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
        </instance>
    </lexelt>
</corpus>

但是,當我運行我的代碼以讀取...中的XML文本時,我只會得到文本,直到到達標簽為止。

import xml.etree.ElementTree as et    
inputfile = "./coach.data"    
root = et.parse(open(inputfile)).getroot()
instances = []

for corpus in root:
    for lexelt in corpus:
      for instance in lexelt:
        instances.append(instance.text)

j=1
for i in instances:
    print "instance " + j
    print "left: " + i
    print "\n"  
    j+=1

現在我只剩下左側:

instance 1
left: I'll buy a train or 

instance 2
left: A branch line train took us to Aubagne where a 

輸出還需要上下文和標題的右側,它應該是:

instance 1
left: I'll buy a train or 
head: coach
right:   ticket.

instance 2
left: A branch line train took us to Aubagne where a 
head: coach
right:  picked us up for the journey up to the camp.

首先,您的代碼有誤。 for corpus in root不是必需的,所以您的根目錄已經是corpus

您可能打算做的是:

for lexelt in root:
  for instance in lexelt:
    for context in instance:
      contexts.append(context.text)

現在,關於您的問題- for context in instance塊中的for context in instance ,您可以訪問所需的其他兩個字符串:

  1. 所述head文本可以通過訪問進行訪問context.find('head').text
  2. 根據Python etree docs,可以通過訪問context.find('head').tail來讀取head元素右側的文本:

tail屬性可用於保存與元素關聯的其他數據。 此屬性通常是字符串,但可以是任何特定於應用程序的對象。 如果該元素是從XML文件創建的,則該屬性將包含在元素的end標記之后和下一個標記之前找到的所有文本。

在ElementTree中,您將不得不考慮子節點的tail屬性。 語料庫也是您的根。

import xml.etree.ElementTree as et    
    inputfile = "./coach.data"    
    corpus = et.parse(open(inputfile)).getroot()

    def getalltext(elem):
        return elem.text + ''.join([getalltext(child) + child.tail for child in elem])

    instances = []
    for lexelt in corpus:
        for instance in lexelt:
            instances.append(getalltext(instance))


    j=1
    for i in instances:
        print "instance " + j
        print "left: " + i
        print "\n"  
        j+=1

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM