简体   繁体   中英

Python XML parser renames namespace variables

I have been using xml.etree.ElementTree to parse a Word XML document. After making my changes I use tree.write('test.xml') to write the tree to a file. Once the XML is saved, Word was unable to read the file. Looking at the XML, it appears that the new XML has all of the namespaces renamed.

For example, w:t became ns2:t

import xml.etree.ElementTree as ET
import re

tree = ET.parse('FL0809spec2.xml')
root = tree.getroot()

l = [' ',' ']
prev = None
count = 0

for t in root.iter('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}t'):
    l[0] = l[1]
    l[1] = t.text
    if(l[0] <> '' and l[1] <> '' and re.search(r'[a-zA-Z]', l[0][len(l[0]) - 1]) and re.search(r'[a-z]', l[1][0])):
        words = re.findall(r'(\b\w+\b)(\W+)',l[1])
        if(len(words) > 0):
            prev.text = prev.text + words[0][0]
            t.text = t.text[len(words[0][0]):]
            count += 1
    prev = t

tree.write('FL0809spec2Improved.xml')

It appears that:

a) Python built-in xml.etree.ElementTree is not idempotent (transparent) - if you read an XML file and then immediately write out the xml, the output is different from the input. The namespace prefixes are changed, for example. Also the initial ?xml and ?mso tags are removed. There may be other differences. The removal of the two initial tags doesn't seem to matter, so it's something about the rest of the XML that Word doesn't like.

and b) MS Word expects the namespaces to be written with exactly the same prefixes as the xml files it generates - IMO this is very poor (if not appalling) style because in pure XML terms it is the namespace URI that defines the namespace, not the prefix used to reference it, but hey ho that's the way it seems to work.

As long as you don't mind installing lxml, to solve your problem is very easy. Happily lxml.etree.ElementTree appears to be a lot more determined than xml.etree.ElementTree about not changing anything when writing what it has read, at least it maintains the prefixes that were read in, and those first two tags are written too.

So to use lxml:

Install xlmx with pip:

pip install lxml

Change the first line of your code from:

import xml.etree.ElementTree as ET

to:

from lxml import etree as ET

Then (in my testing of your code with the changey bits between reading and writing the xml removed) the output document can be opened without error in MS Word :-)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM