為什么我沒有在XML標簽中得到文本？ -python elementtree

Question

如何讀取<context>...</context>標記中的所有文本？ 那么<context \\>標記中的<head>...<\\head>標記又如何呢？

我有一個看起來像這樣的XML文件：

<corpus lang="english">
    <lexelt item="coach.n">
        <instance id="1">
            <context>I'll buy a train or <head>coach</head> ticket.</context>
        </instance>
        <instance id="2">
            <context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
        </instance>
    </lexelt>
</corpus>

但是，當我運行我的代碼以讀取...中的XML文本時，我只會得到文本，直到到達標簽為止。

import xml.etree.ElementTree as et    
inputfile = "./coach.data"    
root = et.parse(open(inputfile)).getroot()
instances = []

for corpus in root:
    for lexelt in corpus:
      for instance in lexelt:
        instances.append(instance.text)

j=1
for i in instances:
    print "instance " + j
    print "left: " + i
    print "\n"  
    j+=1

現在我只剩下左側：

instance 1
left: I'll buy a train or 

instance 2
left: A branch line train took us to Aubagne where a

輸出還需要上下文和標題的右側，它應該是：

instance 1
left: I'll buy a train or 
head: coach
right:   ticket.

instance 2
left: A branch line train took us to Aubagne where a 
head: coach
right:  picked us up for the journey up to the camp.

Answer 1

首先，您的代碼有誤。 for corpus in root不是必需的，所以您的根目錄已經是corpus 。

您可能打算做的是：

for lexelt in root:
  for instance in lexelt:
    for context in instance:
      contexts.append(context.text)

現在，關於您的問題- for context in instance塊中的for context in instance ，您可以訪問所需的其他兩個字符串：

所述head文本可以通過訪問進行訪問context.find('head').text
根據Python etree docs，可以通過訪問context.find('head').tail來讀取head元素右側的文本：

tail屬性可用於保存與元素關聯的其他數據。 此屬性通常是字符串，但可以是任何特定於應用程序的對象。 如果該元素是從XML文件創建的，則該屬性將包含在元素的end標記之后和下一個標記之前找到的所有文本。

Answer 2

在ElementTree中，您將不得不考慮子節點的tail屬性。 語料庫也是您的根。

import xml.etree.ElementTree as et    
    inputfile = "./coach.data"    
    corpus = et.parse(open(inputfile)).getroot()

    def getalltext(elem):
        return elem.text + ''.join([getalltext(child) + child.tail for child in elem])

    instances = []
    for lexelt in corpus:
        for instance in lexelt:
            instances.append(getalltext(instance))


    j=1
    for i in instances:
        print "instance " + j
        print "left: " + i
        print "\n"  
        j+=1

為什么我沒有在XML標簽中得到文本？ -python elementtree

問題描述

2 個解決方案

解決方案1
2 已采納 2012-09-22 07:34:00

解決方案2
1 2012-09-22 08:04:47

為什么我沒有在XML標簽中得到文本？ -python elementtree

問題描述

2 個解決方案

解決方案1 2 已采納 2012-09-22 07:34:00

解決方案2 1 2012-09-22 08:04:47

解決方案1
2 已采納 2012-09-22 07:34:00

解決方案2
1 2012-09-22 08:04:47