[英]Why does printing to a utf-8 file fail?
So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked. 因此,今天下午我遇到了一个问题,我能够解决它,但我不太了解它为什么起作用。
this is related to a problem I had the other week: python check if utf-8 string is uppercase 这与我前一周遇到的一个问题有关: python检查utf-8字符串是否为大写
basically, the following will not work: 基本上,以下将不起作用:
#!/usr/bin/python
import codecs
from lxml import etree
outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
words = ( u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
u'R\xc9SUM\xc9', # RESUME with accents
u'R\xe9sum\xe9', # Resume with accents
u'R\xe9SUM\xe9', ) # ReSUMe with accents
for word in words:
print word
if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8
title = etree.SubElement(sect,'title')
title.text = word
else:
item = etree.SubElement(sect,'item')
item.text = word
print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')
it fails with the following: 它失败并显示以下内容:
Traceback (most recent call last):
追溯(最近一次通话):
File "./temp.py", line 25, in文件“ ./temp.py”,第25行,在
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')打印>> outFile,etree.tostring(root,pretty_print = True,xml_declaration = True,encoding ='utf-8')
File "/usr/lib/python2.7/codecs.py",文件“ /usr/lib/python2.7/codecs.py”,
line 691, in write第691行,写入
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py",返回self.writer.write(data)文件“ /usr/lib/python2.7/codecs.py”,
line 351, in write第351行,写入
data, consumed = self.encode(object, self.errors)消耗的数据= self.encode(object,self.errors)
UnicodeDecodeError: 'ascii' codecUnicodeDecodeError:“ ascii”编解码器
can't decode byte 0xd0 in position 66:无法解码位置66的字节0xd0:
ordinal not in range(128)序数不在范围内(128)
but if I open the new file without codecs.open('test.xml', 'w', 'utf-8')
and instead use outFile = open('test.xml', 'w')
it works perfectly. 但是,如果我不使用
codecs.open('test.xml', 'w', 'utf-8')
打开新文件,而是使用outFile = open('test.xml', 'w')
则效果很好。
So whats happening?? 那么发生了什么?
since encoding='utf-8'
is specified in etree.tostring()
is it encoding the file again? 由于在
etree.tostring()
指定encoding='utf-8'
是否再次对文件进行编码?
if I leave codecs.open()
and remove encoding='utf-8'
the file then becomes an ascii file. 如果我离开
codecs.open()
并删除encoding='utf-8'
,则该文件将成为ascii文件。 Why? 为什么? becuase
etree.tostring()
has a default encoding of ascii I persume? 因为
etree.tostring()
具有我相信的ascii的默认编码?
but etree.tostring()
is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file?? 但是
etree.tostring()
只是被写到stdout,然后重定向到以utf-8文件形式创建的文件?
print>>
not workings as I expect? print>>
不能正常工作吗? outFile.write(etree.tostring())
behaves the same way. outFile.write(etree.tostring())
行为方式相同。 Basically, why wouldn't this work? 基本上,这为什么不起作用? what is going on here.
这里发生了什么。 It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works,
这可能是微不足道的,但是我显然有些困惑,并且渴望弄清楚为什么我的解决方案有效,
You've opened the file with UTF-8 encoding, which means that it expects Unicode strings. 您已经以UTF-8编码打开了文件,这意味着它需要Unicode字符串。
tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file. tostring编码为UTF-8(以字节串,str的形式),您正在将其写入文件。
Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8. 因为该文件需要Unicode,所以它将使用默认的ASCII编码将字节字符串解码为Unicode,以便随后可以将Unicode编码为UTF-8。
Unfortunately, the bytestrings aren't ASCII. 不幸的是,字节串不是ASCII。
EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output. 编辑:避免此类问题的最佳建议是内部使用Unicode,对输入进行解码,对输出进行编码。
Using print>>outFile
is a little strange. 使用
print>>outFile
有点奇怪。 I don't have lxml
installed, but the built-in xml.etree
library is similar (but doesn't support pretty_print
). 我没有安装
lxml
,但是内置的xml.etree
库是类似的(但不支持pretty_print
)。 Wrap the root
Element in an ElementTree and use the write method. 将
root
Element包裹在ElementTree中,然后使用write方法。
Also, if you using a # coding
line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes: 另外,如果使用
# coding
行声明源文件的编码,则可以使用可读的Unicode字符串代替转义码:
#!/usr/bin/python
# coding: utf8
import codecs
from xml.etree import ElementTree as etree
root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')
words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']
for word in words:
print word
if word.isupper():
title = etree.SubElement(sect,u'title')
title.text = word
else:
item = etree.SubElement(sect,u'item')
item.text = word
tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')
In addition to MRABs answer some lines of code: 除了MRAB,还要回答一些代码行:
import codecs
from lxml import etree
root = etree.Element('root')
sect = etree.SubElement(root,'sect')
# do some other xml building here
with codecs.open('test.xml', 'w', encoding='utf-8') as f:
f.write(etree.tostring(root, encoding=unicode))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.