编辑多个XML文档的属性

Question

这是我正在使用的一个XML文档：

<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

我正在遍历一组XML文档，以检索所有以空格开头的句子。 我可以毫不费力地捕获所有错误（前导空格）：

>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}

>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files

>>> for docAddr in xmlAddresses:
>>>    parser = etree.XMLParser(encoding=u'utf-8') 
>>>    tree = etree.parse(docAddr, parser=parser) 
>>>    sentences = getTokenTextFeature(docAddr,tree,sentences) 

>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>>    text = sentences[sent]['sentence']
>>>    if rgxLeadingSpace.findall(text):    
>>>        print text                        # the second sentence is from the above XML doc

" It rallied on ideas the market was oversold , " a trader said . 

" The result of the second year-half is expected to improve on the early part of the year , " Atria said .

" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow .

我需要做的是，找到错误之后，遍历包含这些错误的所有XML文件并调整其START属性。 例如，这是来自上述XML文档的句子，其中包含一个前导空格：

<charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

它看起来应该像这样：

<charseq START="207" END="310">The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

我想我提供了所有必要的代码。 如果有人可以帮助我，我将创建一百万个StackOverflow帐户，并为您投票一百万次！ ：）谢谢！

Answer 1

我将使用的方法是，在执行操作时不提取并在单独的数组中搜索匹配的句子，而是在遍历dom的节点时对照您的模式检查每个句子元素。 这样，当您找到一个对象时，可以使用直接访问的元素对象并修改其START属性，然后只需将修改后的dom写出到新的（或替换的）XML文件中。

Answer 2

我不知道getTokenTextFeature作用，但是这里有一个程序可以按照您要求的方式修改XML。

xml='''<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>
</extent></span></document>
'''

import re
import xml.etree.ElementTree as etree

root = etree.XML(xml)
for charseq in root.findall(".//span[@type='sentence']/extent/charseq[@START]"):
  match = re.match('^("? +)(.*)', charseq.text)
  if match:
    space,text = match.groups()
    charseq.set('START', str(int(charseq.get('START')) + len(space)))
    charseq.text = text
print etree.tostring(root)

编辑多个XML文档的属性

问题描述

2 个解决方案

解决方案1
1 2014-09-18 22:21:22

解决方案2
1 已采纳 2014-09-18 22:21:31

编辑多个XML文档的属性

问题描述

2 个解决方案

解决方案1 1 2014-09-18 22:21:22

解决方案2 1 已采纳 2014-09-18 22:21:31

解决方案1
1 2014-09-18 22:21:22

解决方案2
1 已采纳 2014-09-18 22:21:31