简体   繁体   中英

Reading UTF8 encoded CSV and converting to UTF-16

I'm reading in a CSV file that has UTF8 encoding:

ifile = open(fname, "r")
for row in csv.reader(ifile):
    name = row[0]
    print repr(row[0])

This works fine, and prints out what I expect it to print out; a UTF8 encoded str :

> '\xc3\x81lvaro Salazar'
> '\xc3\x89lodie Yung'
...

Furthermore when I simply print the str (as opposed to repr() ) the output displays ok (which I don't understand eitherway - shouldn't this cause an error?):

> Álvaro Salazar
> Élodie Yung

but when I try to convert my UTF8 encoded strs to unicode :

ifile = open(fname, "r")
for row in csv.reader(ifile):
    name = row[0]
    print unicode(name, 'utf-8')  # or name.decode('utf-8')

I get the infamous:

Traceback (most recent call last):                                       
File "scripts/script.py", line 33, in <module>
    print unicode(fullname, 'utf-8')
UnicodeEncodeError: 'ascii' codec can't encode character u'\xc1' in position 0: ordinal not in range(128)

So I looked at the unicode strings that are created:

ifile = open(fname, "r")
for row in csv.reader(ifile):
    name = row[0]
    unicode_name = unicode(name, 'utf-8')
    print repr(unicode_name)

and the output is

 > u'\xc1lvaro Salazar'
 > u'\xc9lodie Yung'

So now I'm totally confused as these seem to be mangled hex values. I've read this question:

and it appears I am doing everything correctly, leading me to believe that my file is not actually UTF8, but when I initially print out the repr values of the cells, they appear to to correct UTF8 hex values. Can anyone either point out my problem or indicate where my understanding is breaking down (as I'm starting to get lost in the jungle of encodings)


As an aside, I believe I could use codecs to open the file and read it directly into unicode objects, but the csv module doesn't support unicode natively so I can use this approach.

Your default encoding is ASCII. When you try to print a unicode object, the interpreter therefore tries to encode it using the ASCII codec, which fails because your text includes characters that don't exist in ASCII.

The reason that printing the UTF-8 encoded bytestring doesn't produce an error (which seems to confuse you, although it shouldn't) is that this simply sends the bytes to your terminal. It will never produce a Python error, although it may produce ugly output if your terminal doesn't know what to do with the bytes.

To print a unicode, use print some_unicode.encode('utf-8') . (Or whatever encoding your terminal is actually using).

As for the u'\\xc1lvaro Salazar' , nothing here is mangled. The character Á is at the unicode codepoint C1 (which has nothing to do with it's UTF-8 representation, but happens to be the same value as in Latin-1), and Python uses \\x hex escapes instead of \\u\u003c/code> unicode codepoint notation for codepoints that would have 00 as the most significant byte to save space (it could also have displayed this as .)

To get a good overview of how Unicode works in Python, I suggest http://nedbatchelder.com/text/unipain.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM