简体   繁体   中英

Convert utf-8 string to cp950 encoding in python

I'm handling an encoding problem. My input is a unicode string, such as:

>>> s
u'\xa6\xe8\xac\xc9'

Actually it is encoded in cp950. I want to decode it: (notice there's no "u")

>>> print unicode('\xa6\xe8\xac\xc9', 'cp950')
西界

However, I don't know how to get rid of that "u". Direct conversion is not working:

>>> str(s)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)

The result of using encode() is not what I wanted:

>>> s.encode('utf8')
'\xc2\xa6\xc3\xa8\xc2\xac\xc3\x89'

what I want is '\\xa6\\xe8\\xac\\xc9'

This is a bit of an abuse of the unicode type. Characters in a unicode string are expected to be Unicode codepoints (eg u'\西\界' ), and thus are encoding-agnostic. They are not supposed to be bytes from a specific encoding (Python 3 makes this distinction very clear by separating Unicode strings str , from byte strings bytes ).

Since you want to just interpret each codepoint as bytes, you can do

u'\xa6\xe8\xac\xc9'.encode('iso-8859-1')

since the first 256 codepoints of Unicode are defined to be equal to the codepoints of ISO-8859-1. However, please try to fix the issue that gave you this incorrect Unicode string in the first place.

So let's get this straight: you have a sequence of bytes that were read in as Unicode codepoints, and you need them to be interpreted as cp950 instead?

>>> ''.join(chr(ord(c)) for c in s)
'\xa6\xe8\xac\xc9'
>>> print ''.join(chr(ord(c)) for c in s).decode('cp950')
西界

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM