简体   繁体   English

为什么打印到utf-8文件失败?

[英]Why does printing to a utf-8 file fail?

So I ran into a problem this afternoon, I was able to solve it, but I don't quite understand why it worked. 因此,今天下午我遇到了一个问题,我能够解决它,但我不太了解它为什么起作用。

this is related to a problem I had the other week: python check if utf-8 string is uppercase 这与我前一周遇到的一个问题有关: python检查utf-8字符串是否为大写

basically, the following will not work: 基本上,以下将不起作用:

#!/usr/bin/python

import codecs
from lxml import etree

outFile = codecs.open('test.xml', 'w', 'utf-8') #cannot use codecs.open()

root = etree.Element('root')
sect = etree.SubElement(root,'sect')


words = (   u'\u041c\u041e\u0421\u041a\u0412\u0410', # capital of Russia, all uppercase
            u'R\xc9SUM\xc9',    # RESUME with accents
            u'R\xe9sum\xe9',    # Resume with accents
            u'R\xe9SUM\xe9', )  # ReSUMe with accents

for word in words:
    print word
    if word.encode('utf8').decode('utf8').isupper(): #.isupper won't function on utf8 
        title = etree.SubElement(sect,'title')
        title.text = word
    else:
       item = etree.SubElement(sect,'item')
       item.text = word 

print>>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8')

it fails with the following: 它失败并显示以下内容:

Traceback (most recent call last): 追溯(最近一次通话):
File "./temp.py", line 25, in 文件“ ./temp.py”,第25行,在
print >>outFile,etree.tostring(root,pretty_print=True,xml_declaration=True,encoding='utf-8') 打印>> outFile,etree.tostring(root,pretty_print = True,xml_declaration = True,encoding ='utf-8')
File "/usr/lib/python2.7/codecs.py", 文件“ /usr/lib/python2.7/codecs.py”,
line 691, in write 第691行,写入
return self.writer.write(data) File "/usr/lib/python2.7/codecs.py", 返回self.writer.write(data)文件“ /usr/lib/python2.7/codecs.py”,
line 351, in write 第351行,写入
data, consumed = self.encode(object, self.errors) 消耗的数据= self.encode(object,self.errors)
UnicodeDecodeError: 'ascii' codec UnicodeDecodeError:“ ascii”编解码器
can't decode byte 0xd0 in position 66: 无法解码位置66的字节0xd0:
ordinal not in range(128) 序数不在范围内(128)

but if I open the new file without codecs.open('test.xml', 'w', 'utf-8') and instead use outFile = open('test.xml', 'w') it works perfectly. 但是,如果我不使用codecs.open('test.xml', 'w', 'utf-8')打开新文件,而是使用outFile = open('test.xml', 'w')则效果很好。

So whats happening?? 那么发生了什么?

  • since encoding='utf-8' is specified in etree.tostring() is it encoding the file again? 由于在etree.tostring()指定encoding='utf-8'是否再次对文件进行编码?

  • if I leave codecs.open() and remove encoding='utf-8' the file then becomes an ascii file. 如果我离开codecs.open()并删除encoding='utf-8' ,则该文件将成为ascii文件。 Why? 为什么? becuase etree.tostring() has a default encoding of ascii I persume? 因为etree.tostring()具有我相信的ascii的默认编码?

  • but etree.tostring() is simply being written to stdout, and is then redirect to a file that was created as a utf-8 file?? 但是etree.tostring()只是被写到stdout,然后重定向到以utf-8文件形式创建的文件?

    • is print>> not workings as I expect? print>>不能正常工作吗? outFile.write(etree.tostring()) behaves the same way. outFile.write(etree.tostring())行为方式相同。

Basically, why wouldn't this work? 基本上,这为什么不起作用? what is going on here. 这里发生了什么。 It might be trivial, but I am obviously a bit confused and have a desire to figure out why my solution works, 这可能是微不足道的,但是我显然有些困惑,并且渴望弄清楚为什么我的解决方案有效,

You've opened the file with UTF-8 encoding, which means that it expects Unicode strings. 您已经以UTF-8编码打开了文件,这意味着它需要Unicode字符串。

tostring is encoding to UTF-8 (in the form of bytestrings, str), which you're writing to the file. tostring编码为UTF-8(以字节串,str的形式),您正在将其写入文件。

Because the file is expecting Unicode, it's decoding the bytestrings to Unicode using the default ASCII encoding so that it can then encode the Unicode to UTF-8. 因为该文件需要Unicode,所以它将使用默认的ASCII编码将字节字符串解码为Unicode,以便随后可以将Unicode编码为UTF-8。

Unfortunately, the bytestrings aren't ASCII. 不幸的是,字节串不是ASCII。

EDIT: The best advice to avoid this kind of problem is to use Unicode internally, decoding on input and encoding on output. 编辑:避免此类问题的最佳建议是内部使用Unicode,对输入进行解码,对输出进行编码。

Using print>>outFile is a little strange. 使用print>>outFile有点奇怪。 I don't have lxml installed, but the built-in xml.etree library is similar (but doesn't support pretty_print ). 我没有安装lxml ,但是内置的xml.etree库是类似的(但不支持pretty_print )。 Wrap the root Element in an ElementTree and use the write method. root Element包裹在ElementTree中,然后使用write方法。

Also, if you using a # coding line to declare the encoding of the source file, you can use readable Unicode strings instead of escape codes: 另外,如果使用# coding行声明源文件的编码,则可以使用可读的Unicode字符串代替转义码:

#!/usr/bin/python
# coding: utf8

import codecs
from xml.etree import ElementTree as etree

root = etree.Element(u'root')
sect = etree.SubElement(root,u'sect')


words = [u'МОСКВА',u'RÉSUMÉ',u'Résumé',u'RéSUMé']

for word in words:
    print word
    if word.isupper():
        title = etree.SubElement(sect,u'title')
        title.text = word
    else:
       item = etree.SubElement(sect,u'item')
       item.text = word 

tree = etree.ElementTree(root)
tree.write('text.xml',xml_declaration=True,encoding='utf-8')

In addition to MRABs answer some lines of code: 除了MRAB,还要回答一些代码行:

import codecs
from lxml import etree

root = etree.Element('root')
sect = etree.SubElement(root,'sect')

# do some other xml building here

with codecs.open('test.xml', 'w', encoding='utf-8') as f:
    f.write(etree.tostring(root, encoding=unicode))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM