简体   繁体   中英

Python xml.dom.minidom Unicode

I'm trying to create an xml document in python, however some of the strings i'm working with are encoded in unicode. Is there a way to create a text node using xml.dom.minidom using unicode strings? Is there another module I can use?

Thanks.

In theory, per the docs :

the DOMString defined in the recommendation is mapped to a Python string or Unicode string. Applications should be able to handle Unicode whenever a string is returned from the DOM.

so you should be fine with either a Unicode string, or a Python string (utf-8 is the default encoding in XML).

In practice, in Python 2, I've sometimes had problems with Unicode strings in xml.dom (I've switched almost entirely away from it and to ElementTree a while ago, so I'm not positive that the problems are still there in recent Python 2 releases).

If you do meet problems using Unicode strings directly, I think you'll want to try encoded strings instead, eg, thedoc.createTextNode(u'pié'.encode('utf-8')) .

In Python 3, of course, str s are Unicode, so everything's rather different in this regard;-).

The dom objects seem to have an encoding argument, see 20.7.1 of the Python docs. Read the footnote as well; take care to use the proper encoding string.

Is there a way to create a text node using xml.dom.minidom using unicode strings?

Yes, createTextNode always takes Unicode strings. The text model of the XML information set is Unicode, as you can see:

>>> doc= minidom.parseString('<a>b</a>')
>>> doc.documentElement.firstChild.data
u'b'

So:

>>> doc.createTextNode(u'Hell\xF6') # OK
<DOM Text node "u'Hell\xf6'">

Minidom does allow you to put non-Unicode strings in the DOM, but if you do and they contain non-ASCII characters you'll come a cropper later on:

>>> doc.documentElement.appendChild(doc.createTextNode('Hell\xF6')) # Wrong, not Unicode string
<DOM Text node "'Hell\xF6'">

>>> doc.toxml()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 45, in toxml
    return self.toprettyxml("", "", encoding)
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 60, in toprettyxml
    return writer.getvalue()
  File "/usr/lib/python2.6/StringIO.py", line 270, in getvalue
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

This is assuming that by “encoded in unicode” you mean you are using Unicode strings. If you mean something else, like you've got byte strings in a UTF-8 encoding, you need to convert those byte strings to Unicode strings before you put them in the DOM:

>>> b= 'Hell\xc3\xb6'    # Hellö encoded in UTF-8 bytes
>>> u= b.decode('utf-8') # Proper Unicode string Hellö
>>> doc.documentElement.appendChild(doc.createTextNode(u))
>>> doc.toxml()
u'<?xml version="1.0" ?><a>bHell\xf6</a>' # correct!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM