简体   繁体   English

Python xml.dom.minidom Unicode

[英]Python xml.dom.minidom Unicode

I'm trying to create an xml document in python, however some of the strings i'm working with are encoded in unicode. 我正在尝试在python中创建xml文档,但是我正在使用的某些字符串以unicode编码。 Is there a way to create a text node using xml.dom.minidom using unicode strings? 有没有办法使用unicode字符串使用xml.dom.minidom创建文本节点? Is there another module I can use? 我可以使用另一个模块吗?

Thanks. 谢谢。

In theory, per the docs : 理论上,根据文档

the DOMString defined in the recommendation is mapped to a Python string or Unicode string. 建议中定义的DOMString映射到Python字符串或Unicode字符串。 Applications should be able to handle Unicode whenever a string is returned from the DOM. 每当从DOM返回字符串时,应用程序都应该能够处理Unicode。

so you should be fine with either a Unicode string, or a Python string (utf-8 is the default encoding in XML). 因此您可以使用Unicode字符串或Python字符串(utf-8是XML的默认编码)。

In practice, in Python 2, I've sometimes had problems with Unicode strings in xml.dom (I've switched almost entirely away from it and to ElementTree a while ago, so I'm not positive that the problems are still there in recent Python 2 releases). 在实践中,在Python 2中,有时xml.dom中的Unicode字符串存在问题(我不久前几乎完全将其切换到了ElementTree ,所以对于问题仍然存在,我不是很肯定。最新的Python 2版本)。

If you do meet problems using Unicode strings directly, I think you'll want to try encoded strings instead, eg, thedoc.createTextNode(u'pié'.encode('utf-8')) . 如果确实遇到直接使用Unicode字符串的问题,我想您应该改用编码的字符串,例如thedoc.createTextNode(u'pié'.encode('utf-8'))

In Python 3, of course, str s are Unicode, so everything's rather different in this regard;-). 当然,在Python 3中, str是Unicode,因此在这方面一切都大不相同;-)。

The dom objects seem to have an encoding argument, see 20.7.1 of the Python docs. dom对象似乎有一个编码参数,请参见Python文档20.7.1 Read the footnote as well; 还要阅读脚注; take care to use the proper encoding string. 注意使用正确的编码字符串。

Is there a way to create a text node using xml.dom.minidom using unicode strings? 有没有办法使用unicode字符串使用xml.dom.minidom创建文本节点?

Yes, createTextNode always takes Unicode strings. 是的,createTextNode 始终采用Unicode字符串。 The text model of the XML information set is Unicode, as you can see: XML信息集的文本模型是Unicode,如您所见:

>>> doc= minidom.parseString('<a>b</a>')
>>> doc.documentElement.firstChild.data
u'b'

So: 所以:

>>> doc.createTextNode(u'Hell\xF6') # OK
<DOM Text node "u'Hell\xf6'">

Minidom does allow you to put non-Unicode strings in the DOM, but if you do and they contain non-ASCII characters you'll come a cropper later on: Minidom确实允许您将非Unicode字符串放入DOM中,但是如果这样做,并且它们包含非ASCII字符,则稍后您将获得裁剪器:

>>> doc.documentElement.appendChild(doc.createTextNode('Hell\xF6')) # Wrong, not Unicode string
<DOM Text node "'Hell\xF6'">

>>> doc.toxml()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 45, in toxml
    return self.toprettyxml("", "", encoding)
  File "/usr/lib/python2.6/xml/dom/minidom.py", line 60, in toprettyxml
    return writer.getvalue()
  File "/usr/lib/python2.6/StringIO.py", line 270, in getvalue
    self.buf += ''.join(self.buflist)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

This is assuming that by “encoded in unicode” you mean you are using Unicode strings. 假设通过“以unicode编码”表示您正在使用Unicode字符串。 If you mean something else, like you've got byte strings in a UTF-8 encoding, you need to convert those byte strings to Unicode strings before you put them in the DOM: 如果您还有其他意思,例如您使用UTF-8编码获得了字节字符串,则需要先将这些字节字符串转换为Unicode字符串,然后再将其放入DOM中:

>>> b= 'Hell\xc3\xb6'    # Hellö encoded in UTF-8 bytes
>>> u= b.decode('utf-8') # Proper Unicode string Hellö
>>> doc.documentElement.appendChild(doc.createTextNode(u))
>>> doc.toxml()
u'<?xml version="1.0" ?><a>bHell\xf6</a>' # correct!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM