[英]Editing attributes of multiple XML docs
這是我正在使用的一個XML文檔:
<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
<extent>
<charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
</extent>
</span><span type="sentence">
<extent>
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
我正在遍歷一組XML文檔,以檢索所有以空格開頭的句子。 我可以毫不費力地捕獲所有錯誤(前導空格):
>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}
>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files
>>> for docAddr in xmlAddresses:
>>> parser = etree.XMLParser(encoding=u'utf-8')
>>> tree = etree.parse(docAddr, parser=parser)
>>> sentences = getTokenTextFeature(docAddr,tree,sentences)
>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>> text = sentences[sent]['sentence']
>>> if rgxLeadingSpace.findall(text):
>>> print text # the second sentence is from the above XML doc
" It rallied on ideas the market was oversold , " a trader said .
" The result of the second year-half is expected to improve on the early part of the year , " Atria said .
" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow .
我需要做的是,找到錯誤之后,遍歷包含這些錯誤的所有XML文件並調整其START
屬性。 例如,這是來自上述XML文檔的句子,其中包含一個前導空格:
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
它看起來應該像這樣:
<charseq START="207" END="310">The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
我想我提供了所有必要的代碼。 如果有人可以幫助我,我將創建一百萬個StackOverflow帳戶,並為您投票一百萬次! :) 謝謝!
我將使用的方法是,在執行操作時不提取並在單獨的數組中搜索匹配的句子,而是在遍歷dom的節點時對照您的模式檢查每個句子元素。 這樣,當您找到一個對象時,可以使用直接訪問的元素對象並修改其START屬性,然后只需將修改后的dom寫出到新的(或替換的)XML文件中。
我不知道getTokenTextFeature
作用,但是這里有一個程序可以按照您要求的方式修改XML。
xml='''<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
<extent>
<charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
</extent>
</span><span type="sentence">
<extent>
<charseq START="205" END="310">" The result of the second year-half is expected to improve on the early part of the year , " Atria said .</charseq>
</extent></span></document>
'''
import re
import xml.etree.ElementTree as etree
root = etree.XML(xml)
for charseq in root.findall(".//span[@type='sentence']/extent/charseq[@START]"):
match = re.match('^("? +)(.*)', charseq.text)
if match:
space,text = match.groups()
charseq.set('START', str(int(charseq.get('START')) + len(space)))
charseq.text = text
print etree.tostring(root)
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.