[英]xml parsing for this specific xml
<instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with .
</context>
</instance>
我想提取里面的所有文本。 這就是我目前所擁有的。 stuff.text僅在<head></head>
之前打印文本(即,您知道...繼續操作),但是我不知道如何在</head>
之后提取后半部分(即it)。使用...輕松應對。)
import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()
for stuff in instance:
if(stuff.tag == "answer"):
print "the correct answer is %s" % stuff.get('senseid')
if(stuff.tag == "context"):
print dir(stuff)
print stuff.text
如果使用BeautifulSoup是一個選項,那將是微不足道的:
import bs4
xtxt = ''' <instance id="activate.v.bnc.00024693" docsrc="BNC">
<answer instance="activate.v.bnc.00024693" senseid="38201"/>
<context>
Do you know what it is , and where I can get one ? We suspect you had seen the Terrex Autospade , which is made by Wolf Tools . It is quite a hefty spade , with bicycle - type handlebars and a sprung lever at the rear , which you step on to <head>activate</head> it . Used correctly , you should n't have to bend your back during general digging , although it wo n't lift out the soil and put in a barrow if you need to move it ! If gardening tends to give you backache , remember to take plenty of rest periods during the day , and never try to lift more than you can easily cope with .
</context>
</instance>'''
soup = bs4.BeautifulSoup(xtxt)
print soup.find('context').text
得到:
Do you know what it is , and where I can get one ? We suspect you had
seen the Terrex Autospade , which is made by Wolf Tools . It is quite
a hefty spade , with bicycle - type handlebars and a sprung lever at the
rear , which you step on to activate it . Used correctly , you shouldn't
have to bend your back during general digging , although it wo n't lift
out the soil and put in a barrow if you need to move it ! If gardening
tends to give you backache , remember to take plenty of rest periods
during the day , and never try to lift more than you can easily cope
with .
如果您更喜歡使用ElementTree,則應使用itertext
處理所有文本:
import xml.etree.ElementTree as et
tree = et.parse(os.getcwd()+"/../data/train.xml")
instance = tree.getroot()
for stuff in instance:
if(stuff.tag == "answer"):
print "the correct answer is %s" % stuff.get('senseid')
if(stuff.tag == "context"):
print dir(stuff)
print ''.join(stuff.itertext())
如果您確定xml文件正確,那么ElementTree就足夠了,因為它是標准Python庫的一部分,所以您將沒有外部依賴。 但是,如果XML格式不正確,BeautifulSoup可以很好地解決小錯誤。
可以使用元素序列化。 有兩種選擇:
<head></head>
如果使用標簽進行序列化,則可以手動刪除外部<context></context>
標簽:
# convert element to string and remove <context></context> tag
print(et.tostring(stuff).strip().lstrip('<context>').rstrip('</context>')))
# read only text without any tags
print(et.tostring(stuff, method='text'))
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.