简体   繁体   English

从python中的unicode字符串获取字节

[英]getting bytes from unicode string in python

I have an 16bit big endian unicode string represented as u'\䄲' , 我有一个16位大端的unicode字符串表示为u'\䄲'

how can I split it into integers 41 and 32 in python ? 如何在python中将其拆分为整数41和32?

Here are a variety of different ways you may want it. 以下是您可能需要的各种不同方式。

Python 2: Python 2:

>>> chars = u'\u4132'.encode('utf-16be')
>>> chars
'A2'
>>> ord(chars[0])
65
>>> '%x' % ord(chars[0])
'41'
>>> hex(ord(chars[0]))
'0x41'
>>> ['%x' % ord(c) for c in chars]
['41', '32']
>>> [hex(ord(c)) for c in chars]
['0x41', '0x32']

Python 3: Python 3:

>>> chars = '\u4132'.encode('utf-16be')
>>> chars
b'A2'
>>> chars = bytes('\u4132', 'utf-16be')
>>> chars  # Just the same.
b'A2'
>>> chars[0]
65
>>> '%x' % chars[0]
'41'
>>> hex(chars[0])
'0x41'
>>> ['%x' % c for c in chars]
['41', '32']
>>> [hex(c) for c in chars]
['0x41', '0x32']
  • Java: "\䄲".getBytes("UTF-16BE") Java: "\䄲".getBytes("UTF-16BE")
  • Python 2: u'\䄲'.encode('utf-16be') Python 2: u'\䄲'.encode('utf-16be')
  • Python 3: '\䄲'.encode('utf-16be') Python 3: '\䄲'.encode('utf-16be')

These methods return a byte array, which you can convert to an int array easily. 这些方法返回一个字节数组,您可以轻松地将其转换为int数组。 But note that code points above U+FFFF will be encoded using two code units (so with UTF-16BE this means 32 bits or 4 bytes). 但请注意, U+FFFF以上的代码点将使用两个代码单元进行编码(因此使用UTF-16BE,这意味着32位或4个字节)。

"Those" aren't integers, it's a hexadecimal number which represents the code point . “那些”不是整数,它是代表代码点的十六进制数。

If you want to get an integer representation of the code point you need to use ord(u'\䄲') if you now want to convert that back to the unicode character use unicode() which will return a unicode string. 如果你想获得代码点的整数表示,你需要使用ord(u'\䄲')如果你现在想要将它转换回unicode字符,请使用unicode() ,它将返回一个unicode字符串。

>>> c = u'\u4132'
>>> '%x' % ord(c)
'4132'

肮脏的黑客: repr(u'\䄲')将返回"u'\\\䄲'"

Pass the unicode character to ord() to get its code point and then break that code point into individual bytes with int.to_bytes() and then format the output however you want: 将unicode字符传递给ord()以获取其代码点,然后使用int.to_bytes()将该代码点分解为单个字节,然后根据需要格式化输出:

list(map(lambda b: hex(b)[2:], ord('\u4132').to_bytes(4, 'big')))

returns: ['0', '0', '41', '32'] 返回: ['0', '0', '41', '32']

list(map(lambda b: hex(b)[2:], ord('\N{PILE OF POO}').to_bytes(4, 'big')))

returns: ['0', '1', 'f4', 'a9'] 返回: ['0', '1', 'f4', 'a9']

As I have mentioned on another comment, encoding the code point to utf16 will not work as expected for code points outside the BMP (Basic Multilingual Plane) since UTF16 will need a surrogate pair to encode those code points. 正如我在另一条评论中所提到的,将代码点编码为utf16将不能像BMP(基本多语言平面)之外的代码点那样工作,因为UTF16将需要一个代理对来编码这些代码点。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM