简体   繁体   English

为什么我没有在XML标签中得到文本? -python elementtree

[英]Why am I not getting the text in the XML tag? - python elementtree

How do i read the all the text within in the <context>...</context> tag? 如何读取<context>...</context>标记中的所有文本? And how about the <head>...<\\head> tag within the <context \\> tag? 那么<context \\>标记中的<head>...<\\head>标记又如何呢?

I've an XML file that looks like this: 我有一个看起来像这样的XML文件:

<corpus lang="english">
    <lexelt item="coach.n">
        <instance id="1">
            <context>I'll buy a train or <head>coach</head> ticket.</context>
        </instance>
        <instance id="2">
            <context>A branch line train took us to Aubagne where a <head>coach</head> picked us up for the journey up to the camp.</context>
        </instance>
    </lexelt>
</corpus>

But when i ran my code to read the XML text within the ..., I'm only getting the text until i reach the tag. 但是,当我运行我的代码以读取...中的XML文本时,我只会得到文本,直到到达标签为止。

import xml.etree.ElementTree as et    
inputfile = "./coach.data"    
root = et.parse(open(inputfile)).getroot()
instances = []

for corpus in root:
    for lexelt in corpus:
      for instance in lexelt:
        instances.append(instance.text)

j=1
for i in instances:
    print "instance " + j
    print "left: " + i
    print "\n"  
    j+=1

Now I'm just getting the left side: 现在我只剩下左侧:

instance 1
left: I'll buy a train or 

instance 2
left: A branch line train took us to Aubagne where a 

The output needs also the right side of the context and the head, it should be: 输出还需要上下文和标题的右侧,它应该是:

instance 1
left: I'll buy a train or 
head: coach
right:   ticket.

instance 2
left: A branch line train took us to Aubagne where a 
head: coach
right:  picked us up for the journey up to the camp.

First of all, you have a mistake in your code. 首先,您的代码有误。 for corpus in root is not necessary, your root is already corpus . for corpus in root不是必需的,所以您的根目录已经是corpus

What you probably meant to do was: 您可能打算做的是:

for lexelt in root:
  for instance in lexelt:
    for context in instance:
      contexts.append(context.text)

Now, regarding your question - inside the for context in instance block, you can access the other two strings you need: 现在,关于您的问题- for context in instance块中的for context in instance ,您可以访问所需的其他两个字符串:

  1. The head text can be accessed by accessing context.find('head').text 所述head文本可以通过访问进行访问context.find('head').text
  2. The text in the right of your head element can be read by accessing context.find('head').tail According to the Python etree docs : 根据Python etree docs,可以通过访问context.find('head').tail来读取head元素右侧的文本:

The tail attribute can be used to hold additional data associated with the element. tail属性可用于保存与元素关联的其他数据。 This attribute is usually a string but may be any application-specific object. 此属性通常是字符串,但可以是任何特定于应用程序的对象。 If the element is created from an XML file the attribute will contain any text found after the element's end tag and before the next tag. 如果该元素是从XML文件创建的,则该属性将包含在元素的end标记之后和下一个标记之前找到的所有文本。

Within ElementTree you will have to consider the tail property of child nodes. 在ElementTree中,您将不得不考虑子节点的tail属性。 Also corpus IS root in your case. 语料库也是您的根。

import xml.etree.ElementTree as et    
    inputfile = "./coach.data"    
    corpus = et.parse(open(inputfile)).getroot()

    def getalltext(elem):
        return elem.text + ''.join([getalltext(child) + child.tail for child in elem])

    instances = []
    for lexelt in corpus:
        for instance in lexelt:
            instances.append(getalltext(instance))


    j=1
    for i in instances:
        print "instance " + j
        print "left: " + i
        print "\n"  
        j+=1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM