简体   繁体   中英

Escaping unicode strings in python

In python these three commands print the same emoji:

print "\xF0\x9F\x8C\x80"
🌀
print u"\U0001F300"
🌀
print u"\ud83c\udf00"
🌀

How can I translate between \\x, \\u and \\U escaping? I can't figure how these hex numbers are equivalent?

The first one is a byte string:

>>> "\xF0\x9F\x8C\x80".decode('utf8')
u'\U0001f300'

The u"\?\?" one is the UTF16 version (four digit unicode escape)

The u"\\U0001F300" one is actual index of the codepoint.


But how do the numbers relate? This is the difficult question. It's defined by the encoding and there is no obvious relationship. To give you an idea, here is an example of "manually" encoding the codepoint at index 0x1F300 into UTF-8:

The cyclone character 🌀 has index 0x1f300 which falls into the range 0x00010000 - 0x001FFFFF. The template for this range is:

11110... 10...... 10...... 10......

Where you fill in the dots with the binary representation of the codepoint. I can't tell you why the template looks like that, it's just the utf-8 definition.

Here's the binary representation of our codepoint:

>>> u'🌀'
u'\U0001f300'
>>> unichr(0x1f300)
u'\U0001f300'
>>> bin(0x1f300)
'0b11111001100000000'

So if we take the string template and fill it up like this (with some leading zeros because there are more slots in the template than significant digits in our number) we get this:

11110... 10...... 10...... 10......
11110000 10011111 10001100 10000000

Now let's convert that back to hex

>>> 0b11110000100111111000110010000000
4036988032
>>> hex(4036988032)
'0xf09f8c80'

And there you have the UTF8 representation of the codepoint.

For UTF16 there is a similar magic recipe for your codepoint: 0x10000 is subtracted from the index, and then we pad with zeros to get a 20-bit binary representation. The first ten bits are added to 0xD800 to give the first 16-bit code unit. The last ten bits are added to 0xDC00 to give the second 16-bit code unit.

>>> bin(0x1f300 - 0x10000)[2:].rjust(20, '0')
'00001111001100000000'
>>> _[:10], _[10:]
('0000111100', '1100000000')
>>> hex(0b0000111100 + 0xd800)
'0xd83c'
>>> hex(0b1100000000 + 0xdc00)
'0xdf00'

And there's your UTF 16 version, ie the one with the lowercase \\u\u003c/code> escape.

As you can probably understand there may be no obvious numerical relationship between the hex digits in these representations, they are just different encodings of the same code point.

Your first string is a byte string. The fact that it prints a single emoji character means that your console is configured to print UTF-8 encoded characters.

Your second string is a Unicode string with a single codepoint, U+1F300 . The \\U specifies that the next 8 hex digits should be interpreted as a codepoint.

The third string takes advantage of a quirk in the way Unicode strings are stored in Python 2. You've given two UTF-16 entities, which together form the single codepoint U+1F300 the same as the previous string. Each \\u\u003c/code> takes 4 following hex digits. Individually these characters wouldn't be valid Unicode, but because Python 2 stores its Unicode internally as UTF-16 it works out. In Python 3 this wouldn't be valid.

When you print out a Unicode string, and your console encoding is known to be UTF-8, the Unicode strings are encoded to UTF-8 bytes. Thus the 3 strings end up producing the same byte sequence on the output, generating the same character.

See Unicode Literals in Python Source Code

In Python source code, Unicode literals are written as strings prefixed with the 'u' or 'U' character: u'abcdefghijk' . Specific code points can be written using the \\u\u003c/code> escape sequence, which is followed by four hex digits giving the code point. The \\U escape sequence is similar, but expects 8 hex digits, not 4 .

In [1]: "\xF0\x9F\x8C\x80".decode('utf-8')
Out[1]: u'\U0001f300'

In [2]: u'\U0001F300'.encode('utf-8')
Out[2]: '\xf0\x9f\x8c\x80'

In [3]: u'\ud83c\udf00'.encode('utf-8')
Out[3]: '\xf0\x9f\x8c\x80'

\uhhhh     --> Unicode character with 16-bit hex value  
\Uhhhhhhhh --> Unicode character with 32-bit hex value

In Unicode escapes, the first form gives four hex digits to encode a 2-byte (16-bit) character code point, and the second gives eight hex digits for a 4-byte (32-bit) code point. Byte strings support only hex escapes for encoded text and other forms of byte-based data

The other answers describe how Unicode characters can be encoded or embedded as literals in Python 2.x. Let me answer your more meta question, "it's not obvious to me how \\xF0\\x9F and 0001 and d83c are the same number?"

The number assigned to each Unicode "code point" --roughly speaking, to each "character"--can be encoded in multiple ways. This is similar to how integers can be encoded in several ways:

  • 0b1100100 (binary, base 2)
  • 0144 (octal, base 8)
  • 100 (decimal, base 10)
  • 0x64 (hexadecimal, base 16)

Those are all the same value, decimal 100, with different encodings. The following is a true expression in Python:

0b1100100 == 0144 == 100 == 0x64

Unicode's encodings are a bit more complex, but the principle is the same. Just because the values don't look the same doesn't mean they don't signify the same value. In Python 2:

u'\ud83c\udf00' == u'\U0001F300' == "\xF0\x9F\x8C\x80".decode("utf-8")

Python 3 changes the rules for string literals, but it's still true that:

u'\U0001F300' == b"\xF0\x9F\x8C\x80".decode("utf-8") 

Where the explicit b (bytes prefix) is required. The u (Unicode prefix) is optional, as all strings are considered to contain Unicode, and the u is only permitted in 3.3 and later. The multi-byte combo characters...well, they weren't that pretty anyway, were they?

So you presented various encodings of the Unicode CYCLONE code point, and the other answers showed some ways to move between code points. See this for even more encodings of that one character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM