ElementTree 和 unicode

Question

我在 xml 文件中有这个字符：

<data>
  <products>
      <color>fumè</color>
  </product>
</data>

我尝试使用以下代码生成 ElementTree 的实例：

string_data = open('file.xml')
x = ElementTree.fromstring(unicode(string_data.encode('utf-8')))

我收到以下错误：

UnicodeEncodeError: 'ascii' codec can't encode character u'\xe8' in position 185: ordinal not in range(128)

（注意：位置不准确，我从一个较大的 xml 中采样）。

如何解决？ 谢谢

Answer 1

您可能在使用Requests (HTTP for Humans)时偶然发现了这个问题，默认情况下response.text解码响应，您可以使用response.content获取未解码的数据，因此 ElementTree 可以对其进行解码。 请记住使用正确的编码。

更多信息： http : //docs.python-requests.org/en/latest/user/quickstart/#response-content

Answer 2

您需要将 utf-8 字符串解码为 unicode 对象。 所以

string_data.encode('utf-8')

应该

string_data.decode('utf-8')

假设string_data实际上是一个 utf-8 字符串。

因此，要总结：从您编码unicode的unicode的对象得到一个UTF-8字符串（使用UTF-8编码），并且将一个字符串为您解码使用相应的编码字符串一个Unicode对象。

有关概念的更多详细信息，我建议阅读每个软件开发人员绝对、肯定必须了解 Unicode 和字符集（非 Python 特定）的绝对最小值。

Answer 3

你并不需要解码XML的ElementTree的工作。 XML 携带它自己的编码信息（默认为 UTF-8），ElementTree 为您完成工作，输出 unicode：

>>> data = '''\
... <data>
...   <products>
...       <color>fumè</color>
...   </products>
... </data>
... '''
>>> x = ElementTree.fromstring(data)
>>> x[0][0].text
u'fum\xe8'

如果您的数据包含在文件（类似）对象中，只需将文件名或文件对象直接传递给ElementTree.parse()函数：

x = ElementTree.parse('file.xml')

Answer 4

您是否尝试过使用parse函数，而不是打开文件...（顺便说一句，在它之后需要.read()才能使.fromstring()工作...）

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()
# etc...

Answer 5

您的文件很可能不是 UTF-8。 è字符可以来自其他一些编码，例如latin-1 。

Answer 6

函数open()不返回string 。 而是使用open('file.xml').read() 。

ElementTree 和 unicode

问题描述

6 个解决方案

解决方案1
34 2013-12-29 13:00:20

解决方案2
15 2012-09-10 10:30:52

解决方案3
11 已采纳 2012-09-10 10:35:18

解决方案4
2 2012-09-10 10:34:52

解决方案5
1 2012-09-10 10:28:56

解决方案6
1 2014-03-10 10:07:35

ElementTree 和 unicode

问题描述

6 个解决方案

解决方案1 34 2013-12-29 13:00:20

解决方案2 15 2012-09-10 10:30:52

解决方案3 11 已采纳 2012-09-10 10:35:18

解决方案4 2 2012-09-10 10:34:52

解决方案5 1 2012-09-10 10:28:56

解决方案6 1 2014-03-10 10:07:35

解决方案1
34 2013-12-29 13:00:20

解决方案2
15 2012-09-10 10:30:52

解决方案3
11 已采纳 2012-09-10 10:35:18

解决方案4
2 2012-09-10 10:34:52

解决方案5
1 2012-09-10 10:28:56

解决方案6
1 2014-03-10 10:07:35