简体   繁体   中英

Python UTF-8 Latin-1 displays wrong character

I'm writing a very small script that can convert latin-1 characters into unicode (I'm a complete beginner in Python).

I tried a method like this:

def latin1_to_unicode(character):

    uni = character.decode('latin-1').encode("utf-8")
    retutn uni

It works fine for characters that are not specific to the latin-1 set, but if I try the following example:

print latin1_to_Unicode('å')

It returns Ã¥ instead of å . Same goes for other letters like æ and ø .

Can anyone please explain why this is happening? Thanks

I have the # -*- coding: utf8 -*- declaration in my script, if it matters any to the problem

Your source code is encoded to UTF-8, but you are decoding the data as Latin-1. Don't do that, you are creating a Mojibake .

Decode from UTF-8 instead, and don't encode again . print will write to sys.stdout which will have been configured with your terminal or console codec (detected when Python starts).

My terminal is configured for UTF-8, so when I enter the å character in my terminal, UTF-8 data is produced:

>>> 'å'
'\xc3\xa5'
>>> 'å'.decode('latin1')
u'\xc3\xa5'
>>> print 'å'.decode('latin1')
Ã¥

You can see that the character uses two bytes ; when saving your Python source with an editor configured to use UTF-8, Python reads the exact same bytes from disk to put into your bytestring.

Decoding those two bytes as Latin-1 produces two Unicode codepoints corresponding to the Latin-1 codec.

You probably want to do some studying on the difference between Unicode and encodings, and how that relates to Python:

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM