为什么打印到utf-8文件失败？

Question

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked. 因此，今天下午我遇到了一个问题，我能够解决它，但我不太了解它为什么起作用。

this is related to a problem I had the other week: python check if utf-8 string is uppercase 这与我前一周遇到的一个问题有关： python检查utf-8字符串是否为大写

basically, the following will not work: 基本上，以下将不起作用：

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

it fails with the following: 它失败并显示以下内容：

Traceback (most recent call last): 追溯（最近一次通话）：
File "./temp.py", line 25, in 文件“ ./temp.py”，第25行，在
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8') 打印>> outFile，etree.tostring（root，pretty_print = True，xml_declaration = True，encoding ='utf-8'）
File "/usr/lib/python2.7/codecs.py", 文件“ /usr/lib/python2.7/codecs.py”，
line 691, in write 第691行，写入
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py", 返回self.writer.write（data）文件“ /usr/lib/python2.7/codecs.py”，
line 351, in write 第351行，写入
data, consumed = self.encode(object, self.errors) 消耗的数据= self.encode（object，self.errors）
UnicodeDecodeError: 'ascii' codec UnicodeDecodeError：“ ascii”编解码器
can't decode byte 0xd0 in position 66: 无法解码位置66的字节0xd0：
ordinal not in range(128) 序数不在范围内（128）

but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use outFile = open('test.xml', 'w') it works perfectly. 但是，如果我不使用codecs.open('test.xml', 'w', 'utf-8')打开新文件，而是使用outFile = open('test.xml', 'w')则效果很好。

So whats happening?? 那么发生了什么？

since encoding='utf-8' is specified in etree.tostring() is it encoding the file again? 由于在etree.tostring()指定encoding='utf-8'是否再次对文件进行编码？
if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. 如果我离开codecs.open()并删除encoding='utf-8' ，则该文件将成为ascii文件。 Why? 为什么？ becuase etree.tostring() has a default encoding of ascii I persume? 因为etree.tostring()具有我相信的ascii的默认编码？
but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file?? 但是etree.tostring()只是被写到stdout，然后重定向到以utf-8文件形式创建的文件？
- is print>> not workings as I expect? 是print>>不能正常工作吗？ outFile.write(etree.tostring()) behaves the same way. outFile.write(etree.tostring())行为方式相同。

Basically, why wouldn't this work? 基本上，这为什么不起作用？ what is going on here. 这里发生了什么。 It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works, 这可能是微不足道的，但是我显然有些困惑，并且渴望弄清楚为什么我的解决方案有效，

Answer 1

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings. 您已经以UTF-8编码打开了文件，这意味着它需要Unicode字符串。

tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file. tostring编码为UTF-8（以字节串，str的形式），您正在将其写入文件。

Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8. 因为该文件需要Unicode，所以它将使用默认的ASCII编码将字节字符串解码为Unicode，以便随后可以将Unicode编码为UTF-8。

Unfortunately, the bytestrings aren't ASCII. 不幸的是，字节串不是ASCII。

EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output. 编辑：避免此类问题的最佳建议是内部使用Unicode，对输入进行解码，对输出进行编码。

Answer 2

Using print>>outFile is a little strange. 使用print>>outFile有点奇怪。 I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print ). 我没有安装lxml ，但是内置的xml.etree库是类似的（但不支持pretty_print ）。 Wrap the root Element in an ElementTree and use the write method. 将root Element包裹在ElementTree中，然后使用write方法。

Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes: 另外，如果使用# coding行声明源文件的编码，则可以使用可读的Unicode字符串代替转义码：

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

Answer 3

In addition to MRABs answer some lines of code: 除了MRAB，还要回答一些代码行：

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

为什么打印到utf-8文件失败？

问题描述

3 个解决方案

解决方案1
3 已采纳 2011-06-29 22:21:08

解决方案2
1 2011-06-30 00:28:01

解决方案3
0 2012-05-04 18:32:34

为什么打印到utf-8文件失败？

问题描述

3 个解决方案

解决方案1 3 已采纳 2011-06-29 22:21:08

解决方案2 1 2011-06-30 00:28:01

解决方案3 0 2012-05-04 18:32:34

解决方案1
3 已采纳 2011-06-29 22:21:08

解决方案2
1 2011-06-30 00:28:01

解决方案3
0 2012-05-04 18:32:34