此特定xml的xml解析

Question

    <instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
</context>
</instance>

我想提取里面的所有文本。 這就是我目前所擁有的。 stuff.text僅在<head></head>之前打印文本（即，您知道...繼續操作），但是我不知道如何在</head>之后提取后半部分（即it）。使用...輕松應對。）

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print stuff.text

Answer 1

如果使用BeautifulSoup是一個選項，那將是微不足道的：

import bs4
xtxt = '''        <instance id="activate.v.bnc.00024693" docsrc="BNC">
    <answer instance="activate.v.bnc.00024693" senseid="38201"/>
    <context>
    Do you know what it is ,  and where I can get one ?  We suspect you had seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly ,  you should n't have to bend your back during general digging ,  although it wo n't lift out the soil and put in a barrow if you need to move it !  If gardening tends to give you backache ,  remember to take plenty of rest periods during the day ,  and never try to lift more than you can easily cope with .  
    </context>
    </instance>'''
soup = bs4.BeautifulSoup(xtxt)
print soup.find('context').text

得到：

Do you know what it is ,  and where I can get one ?  We suspect you had
seen the Terrex Autospade ,  which is made by Wolf Tools .  It is quite 
a hefty spade , with bicycle - type handlebars and a sprung lever at the 
rear , which you step on to activate it . Used correctly ,  you shouldn't 
have to bend your back during general digging ,  although it wo n't lift 
out the soil and put in a barrow if you need to move it !  If gardening 
tends to give you backache ,  remember to take plenty of rest periods 
during the day ,  and never try to lift more than you can easily cope 
with .

如果您更喜歡使用ElementTree，則應使用itertext處理所有文本：

import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()

    for stuff in instance:
        if(stuff.tag == "answer"):
            print "the correct answer is %s" % stuff.get('senseid')
        if(stuff.tag == "context"):
            print dir(stuff)
            print ''.join(stuff.itertext())

如果您確定xml文件正確，那么ElementTree就足夠了，因為它是標准Python庫的一部分，所以您將沒有外部依賴。 但是，如果XML格式不正確，BeautifulSoup可以很好地解決小錯誤。

Answer 2

可以使用元素序列化。 有兩種選擇：

保持內部<head></head>
僅返回沒有任何標簽的文本。

如果使用標簽進行序列化，則可以手動刪除外部<context></context>標簽：

# convert element to string and remove <context></context> tag
print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>')))
# read only text without any tags
print(et.tostring(stuff, method='text'))

此特定xml的xml解析

問題描述

2 個解決方案

解決方案1
0 已采納 2015-10-19 14:57:15

解決方案2
0 2015-10-19 15:33:40

此特定xml的xml解析

問題描述

2 個解決方案

解決方案1 0 已采納 2015-10-19 14:57:15

解決方案2 0 2015-10-19 15:33:40

解決方案1
0 已采納 2015-10-19 14:57:15

解決方案2
0 2015-10-19 15:33:40