简体   繁体   中英

Python decode in unicode variable with non-ascii character or without

A simple example:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import traceback

e_u = u'abc'
c_u = u'中国'

print sys.getdefaultencoding()
try:
    print e_u.decode('utf-8')
    print c_u.decode('utf-8')
except Exception as e:
    print traceback.format_exc()

reload(sys)
sys.setdefaultencoding('utf-8')
print sys.getdefaultencoding()
try:
    print e_u.decode('utf-8')
    print c_u.decode('utf-8')
except Exception as e:
    print traceback.format_exc()

output:

ascii
abc
Traceback (most recent call last):
  File "test_codec.py", line 15, in <module>
    print c_u.decode('utf-8')
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-1: ordinal not in range(128)

utf-8
abc
中国

Some problems troubled me a few days when I want to thoroughly understand the codec in python, and I want to make sure what I think is right:

  1. Under ascii default encoding, u'abc'.decode('utf-8') have no error, but u'中国'.decode('utf-8') have error.

    I think when do u'中国'.decode('utf-8') , Python check and found u'中国' is unicode, so it try to do u'中国'.encode(sys.getdefaultencoding()) , this will cause problem, and the exception is UnicodeEncodeError , not error when decode.

    but u'abc' have the same code point as 'abc' ( < 128), so there is no error.

  2. In Python 2.x, how does python inner store variable value? If all characters in a string < 128, treat as ascii , if > 128, treat as utf-8 ?

     In [4]: chardet.detect('abc') Out[4]: {'confidence': 1.0, 'encoding': 'ascii'} In [5]: chardet.detect('abc中国') Out[5]: {'confidence': 0.7525, 'encoding': 'utf-8'} In [6]: chardet.detect('中国') Out[6]: {'confidence': 0.7525, 'encoding': 'utf-8'} 

Short answer

You have to use encode() , or leave it out. Don't use decode() with unicode strings, that makes no sense. Also, sys.getdefaultencoding() doesn't help here in any way.

Long answer, part 1: How to do it correctly?

If you define:

c_u = u'中国'

then c_u is already a unicode string, that is, it has already been decoded from byte string (of your source file) to a unicode string by the Python interpreter, using your -*- coding: utf-8 -*- declaration.

If you execute:

print c_u.encode()

your string will be encoded back to UTF-8 and that byte string is sent to the standard output. Note that this usually happens automatically for you, so you can simplify this to:

print c_u

Long answer, part 2: What's wrong with c_u.decode()?

If you execute c_u.decode() , Python will

  1. Try to convert your object (ie your unicode string) to a byte string
  2. Try to decode that byte string to a unicode string

Note that this doesn't make any sense if your object is a unicode string in the first place - you just convert it forth and back. But why does that fail? Well, this is a strange functionality of Python that first step (1.), ie any implicit conversion from unicode string to byte strings, usually uses sys.getdefaultencoding(), which in turn defaults to the ASCII character set. In other words,

c_u.decode()

translates roughly to:

c_u.encode(sys.getdefaultencoding()).decode()

which is why it fails.

Note that while you may be tempted to change that default encoding, don't forget that other third-party libraries may contain similar issues, and might break if the default encoding is different from ASCII.

Having said that, I strongly believe that Python would be better off if they hadn't defined unicode.decode() in the first place. Unicode string are already decoded, there's no point in decoding them once more, especially in the way Python does.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM