简体   繁体   中英

Editing attributes of multiple XML docs

Here's one XML doc I'm working with:

<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

I'm looping through a set of XML docs to retrieve all sentences that begin with a space. I have no trouble capturing all the errors (leading spaces) with this:

>>> import re, os, sys
>>> import xml.etree.ElementTree as etree
>>> sentences = {}

>>> xmlAddresses = getListOfFilesInFolders(['XMLFiles'],ending=u'.xml') # my function to grab all XML files

>>> for docAddr in xmlAddresses:
>>>    parser = etree.XMLParser(encoding=u'utf-8') 
>>>    tree = etree.parse(docAddr, parser=parser) 
>>>    sentences = getTokenTextFeature(docAddr,tree,sentences) 

>>> rgxLeadingSpace = re.compile('^\"? .')
>>> for sent in sentences.keys():
>>>    text = sentences[sent]['sentence']
>>>    if rgxLeadingSpace.findall(text):    
>>>        print text                        # the second sentence is from the above XML doc

" It rallied on ideas the market was oversold , " a trader said . 

" The result of the second year-half is expected to improve on the early part of the year , " Atria said .

" The head of state 's holiday has only just begun , " the agency quoted Sergei Yastrzhembsky as saying , adding that the president was currently in a Kremlin residence near Moscow . 

What I need to do is, after finding the errors, loop through all the XML files which contain those errors and adjust their START attributes. For example, this is a sentence from the above XML doc that contained a leading space:

<charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

It should look like this:

<charseq START="207" END="310">The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>

I think I provided all the necessary code. If someone can help me I will create a million StackOverflow accounts and upvote you a million times! :) Thanks!

The approach I would use would be to not extract out and then search the matching sentences in a separate array as you're doing, but instead while traversing the nodes of the dom check each sentence element against your pattern. That way when you find one, you can use the element object you're visiting directly and modify its START attribute, and then simply write out the modified dom to a new (or replacement) XML file.

I don't know what getTokenTextFeature does, but here is a program that modifies the XML in the manner you asked for.

xml='''<?xml version="1.0"?>
<document DOCID="501.conll.txt">
<span type="sentence">
  <extent>
    <charseq START="0" END="30">ATRIA SEES H2 RESULT UP ON H1 .</charseq>
  </extent>
</span><span type="sentence">
  <extent>
    <charseq START="205" END="310">" The result of the second year-half is expected to improve on     the early part of the year , " Atria said .</charseq>
</extent></span></document>
'''

import re
import xml.etree.ElementTree as etree

root = etree.XML(xml)
for charseq in root.findall(".//span[@type='sentence']/extent/charseq[@START]"):
  match = re.match('^("? +)(.*)', charseq.text)
  if match:
    space,text = match.groups()
    charseq.set('START', str(int(charseq.get('START')) + len(space)))
    charseq.text = text
print etree.tostring(root)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM