简体   繁体   中英

Python Unicode Bug

I'm making a virtual machine in RPython using PyPy. The problem is, when I tried to add unicode support I found an unusual problem. I'll use the letter "á" in my examples.

# The char in the example is á
print len(char)

OUTPUT:
2

I understand how the letter "á" takes two bytes, hence the length of 2. But the problem is when I use this example below I am faced with the problem.

# In this example instr = "á" (including the quotes)
for char in instr:
    print hex(int(ord(char)))

OUTPUT:
0x22
0xc3
0xa1
0x22

As you can there are 4 numbers. For 0x22 are for the quotes, but there is only 1 letter in between the quotes but there are two numbers. My question is, some machines I tested this script on produced this output:

OUTPUT:
0x22
0xe1
0x22

Is there anyway to make the output the same on both machines? The script is exactly the same on each.

The program is not being given the same input on the two machines:

In [154]: '\xe1'.decode('cp1252').encode('utf_8') == '\xc3\xa1'
Out[154]: True

When you type á in a console, you may see the glyph á , but the console is translating that into bytes. The particular bytes it translates that into depends on the encoding used by the console. On a Windows machine, that may be cp1252 , while on a Unix machine it is likely to be utf-8 .

So you may see the input as the same, but the console (and thus the program) receives different input.

If your program were to decode the bytes with the appropriate encoding, and then work with unicode, then both programs will operate the same after that point. If you are receiving the bytes from sys.stdin , then sys.stdin.encoding will be the encoding Python detects the console is using.

You have this question tagged "Python-3.x" -- is it possible that some machines are running Python 2.x, and others are running Python 3.x?

The character á is in fact U+00E1 , so on a Python 3.x system, I would expect to see your second output. Since strings are Unicode in Python3 by default, len(char) will be 3 (including the quotes).

In Python 2.x, that same character in a string will be two bytes long, and (depending on your input method) will be represented in UTF-8 as \\xc3\\xa1 . On that system, len(char) will be 4, and you would see your first output.

The issue is that you use bytestrings to work with a text data. You should use Unicode instead.

It implies that you need to know the character encoding of your input data -- There Ain't No Such Thing As Plain Text .

If you know the character encoding then it is easy to convert a bytestring to Unicode eg:

unicode_text = bytestring.decode(encoding)

It should resolve your initial issue.

There are also Unicode normalization forms eg:

import unicodedata

norm_text = unicodedata.normalize('NFC', unicode_text)

If I don't change the encoding in the program how can I output unicode characters for example?

You might mean that you have a sequence of bytes eg, '\\xc3\\xa1' (two bytes) that can be interpreted as text using some character encoding eg, it is U+00E1 Unicode codepoint in utf-8. It may be something different in a different character encoding. Please, read the link I've provided above The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) .

Unless by accident your terminal uses the same character encoding as data in your input file; you need to be able to convert from one character encoding to another. Otherwise the output will be corrupted eg, instead of á you might get ├б on the screen.

In ordinary Python, you could use bytes.decode, unicode.encode methods (or codecs module directly). I don't know whether it is possible in RPython.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM