[英]Reading UTF8 encoded CSV and converting to UTF-16
I'm reading in a CSV file that has UTF8 encoding: 我正在读取具有UTF8编码的CSV文件:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print repr(row[0])
This works fine, and prints out what I expect it to print out; 这可以正常工作,并打印出我希望打印出的内容; a UTF8 encoded
str
: UTF8编码的
str
:
> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...
Furthermore when I simply print the str
(as opposed to repr()
) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?): 此外,当我只打印
str
(与repr()
相对)时,输出显示ok(无论如何我都不明白-这是否会导致错误?):
> Álvaro Salazar
> Élodie Yung
but when I try to convert my UTF8 encoded strs
to unicode
: 但是当我尝试将我的UTF8编码的
strs
转换为unicode
:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
print unicode(name, 'utf-8') # or name.decode('utf-8')
I get the infamous: 我臭名昭著:
Traceback (most recent call last):
File "scripts/script.py", line 33, in <module>
print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)
So I looked at the unicode strings that are created: 因此,我查看了创建的unicode字符串:
ifile = open(fname, "r")
for row in csv.reader(ifile):
name = row[0]
unicode_name = unicode(name, 'utf-8')
print repr(unicode_name)
and the output is 输出是
> u'\xc1lvaro Salazar'
> u'\xc9lodie Yung'
So now I'm totally confused as these seem to be mangled hex values. 因此,现在我完全感到困惑,因为这些值似乎是错误的十六进制值。 I've read this question:
我读过这个问题:
and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr
values of the cells, they appear to to correct UTF8 hex values. 看来我做得一切正确,使我相信我的文件实际上不是UTF8,但是当我最初打印出单元格的
repr
值时,它们似乎可以纠正UTF8十六进制值。 Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings) 任何人都可以指出我的问题或指出我的理解在哪里破裂(因为我开始迷失在编码的丛林中)
As an aside, I believe I could use codecs
to open the file and read it directly into unicode objects, but the csv
module doesn't support unicode natively so I can use this approach. 顺便说一句,我相信我可以使用
codecs
打开文件并将其直接读取到unicode对象中,但是csv
模块本身不支持unicode,因此我可以使用这种方法。
Your default encoding is ASCII. 您的默认编码是ASCII。 When you try to print a
unicode
object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII. 因此,当您尝试打印
unicode
对象时,解释器将尝试使用ASCII编解码器对其进行编码,这会失败,因为您的文本包含了ASCII中不存在的字符。
The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. 打印UTF-8编码的字节串不会产生错误的原因(这似乎会让您感到困惑,尽管应该不会),原因是这只是将字节发送到您的终端。 It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.
尽管您的终端不知道如何处理字节,但是它永远不会产生Python错误,尽管它可能会产生难看的输出。
To print a unicode, use print some_unicode.encode('utf-8')
. 要打印unicode,请使用
print some_unicode.encode('utf-8')
。 (Or whatever encoding your terminal is actually using). (或您的终端实际使用的任何编码)。
As for the u'\\xc1lvaro Salazar'
, nothing here is mangled. 至于
u'\\xc1lvaro Salazar'
,这里什么也没弄乱。 The character Á
is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \\x
hex escapes instead of \\u\u003c/code> unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as
\Á
.)
字符
Á
位于Unicode代码点C1(与UTF-8表示形式无关,但恰好与Latin-1中的值相同),Python使用\\x
十六进制转义符代替\\u\u003c/code> Unicode代码点表示法对于将00作为最高有效字节以节省空间的代码点(它也可能显示为
\Á
。
To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html
为了全面了解Unicode在Python中的工作方式,我建议http://nedbatchelder.com/text/unipain.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.