读取UTF8编码的CSV并将其转换为UTF-16

Question

I'm reading in a CSV file that has UTF8 encoding: 我正在读取具有UTF8编码的CSV文件：

ifile = open(fname, "r")
for row in csv.reader(ifile):
    name = row[0]
    print repr(row[0])

This works fine, and prints out what I expect it to print out; 这可以正常工作，并打印出我希望打印出的内容； a UTF8 encoded str : UTF8编码的str ：

> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...

Furthermore when I simply print the str (as opposed to repr() ) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?): 此外，当我只打印str （与repr()相对）时，输出显示ok（无论如何我都不明白-这是否会导致错误？）：

> Álvaro Salazar
> Élodie Yung

but when I try to convert my UTF8 encoded strs to unicode : 但是当我尝试将我的UTF8编码的strs转换为unicode ：

ifile = open(fname, "r")
for row in csv.reader(ifile):
    name = row[0]
    print unicode(name, 'utf-8')  # or name.decode('utf-8')

I get the infamous: 我臭名昭著：

Traceback (most recent call last):                                       
File "scripts/script.py", line 33, in <module>
    print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)

So I looked at the unicode strings that are created: 因此，我查看了创建的unicode字符串：

ifile = open(fname, "r")
for row in csv.reader(ifile):
    name = row[0]
    unicode_name = unicode(name, 'utf-8')
    print repr(unicode_name)

and the output is 输出是

 > u'\xc1lvaro Salazar'
 > u'\xc9lodie Yung'

So now I'm totally confused as these seem to be mangled hex values. 因此，现在我完全感到困惑，因为这些值似乎是错误的十六进制值。 I've read this question: 我读过这个问题：

Reading a UTF8 CSV file with Python 使用Python读取UTF8 CSV文件

and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. 看来我做得一切正确，使我相信我的文件实际上不是UTF8，但是当我最初打印出单元格的repr值时，它们似乎可以纠正UTF8十六进制值。 Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings) 任何人都可以指出我的问题或指出我的理解在哪里破裂（因为我开始迷失在编码的丛林中）

As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach. 顺便说一句，我相信我可以使用codecs打开文件并将其直接读取到unicode对象中，但是csv模块本身不支持unicode，因此我可以使用这种方法。

Answer 1

Your default encoding is ASCII. 您的默认编码是ASCII。 When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII. 因此，当您尝试打印unicode对象时，解释器将尝试使用ASCII编解码器对其进行编码，这会失败，因为您的文本包含了ASCII中不存在的字符。

The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. 打印UTF-8编码的字节串不会产生错误的原因（这似乎会让您感到困惑，尽管应该不会），原因是这只是将字节发送到您的终端。 It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes. 尽管您的终端不知道如何处理字节，但是它永远不会产生Python错误，尽管它可能会产生难看的输出。

To print a unicode, use print some_unicode.encode('utf-8') . 要打印unicode，请使用print some_unicode.encode('utf-8') 。 (Or whatever encoding your terminal is actually using). （或您的终端实际使用的任何编码）。

As for the u'\\xc1lvaro Salazar' , nothing here is mangled. 至于u'\\xc1lvaro Salazar' ，这里什么也没弄乱。 The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \\x hex escapes instead of \\u\u003c/code> unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as \Á .) 字符Á位于Unicode代码点C1（与UTF-8表示形式无关，但恰好与Latin-1中的值相同），Python使用\\x十六进制转义符代替\\u\u003c/code> Unicode代码点表示法对于将00作为最高有效字节以节省空间的代码点（它也可能显示为\Á 。

To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html 为了全面了解Unicode在Python中的工作方式，我建议http://nedbatchelder.com/text/unipain.html

读取UTF8编码的CSV并将其转换为UTF-16

问题描述

1 个解决方案

解决方案1
5 已采纳 2013-08-28 11:18:50

读取UTF8编码的CSV并将其转换为UTF-16

问题描述

1 个解决方案

解决方案1 5 已采纳 2013-08-28 11:18:50

解决方案1
5 已采纳 2013-08-28 11:18:50