简体   繁体   中英

Representing multiple values with one character in Python

I have 2 values that are in the range 0-31. I want to be able to represent both of these values in 1 character (for example in base 64 to explain what I mean by 1 character) but still be able to know what both of the values are and which came first.

Find a nice Unicode block that has 1024 contiguous codepoints, for example CJK Unified Ideographs , and map your 32*32 values onto them. In Python 3:

def char_encode(a, b):
  return chr(0x4E00 + a * 32 + b)

def char_decode(c):
  return divmod(ord(c) - 0x4E00, 32)

print(char_encode(17, 3))
# => 倣

print(char_decode('倣'))
# => (17, 3)

As you mention Base64... this is impossible. Each character in a Base64 encoding only allows for 6 bits of data, and you need 10 to represent your two numbers.

And also note that while this is only one character, it takes up two or three bytes, depending on the encoding you use. As noted by others, there is no way to stuff 10 bits of data into an 8-bit byte.


Explanation: a * 32 + b simply maps two numbers in range [0, 32) into a single number in range [0, 1024). For example, 0 * 32 + 0 = 0 ; 31 * 32 + 31 = 1023 . chr finds the Unicode character with that codepoint, but characters with low codepoints like 0 are not printable, and would be a poor choice, so the result is shifted to the beginning of a nice big Unicode block: 0x4E00 is a hexadecimal representation of 19968 , and is the codepoint of the first character in the CJK Unified Ideographs block. Using the example values, 17 * 32 + 3 = 547 and 19968 + 547 = 20515 , or 0x5023 in hexadecimal, which is the codepoint of the character. Thus, chr(20515) = "倣" .

The char_decode function just does all of these operations in reverse: if a * p + b = x , then a, b = divmod(x, p) (see divmod ). If c = chr(x) , then x = ord(c) (see ord ). And I am sure you know that if w + r = y , then r = y - w . So in the example, ord("倣") = 20515 ; 20515 - 0x4E00 = 547 ; and divmod(547, 32) is (17, 3) .

Values [0, 31] can be stored in 5 bits, since 2**5 == 32 . You can therefore unambiguously store two such values in 10 bits. Conversely, you will not be able to unambiguously retrieve two 5-bit values from fewer than 10 bits unless some other conditions hold true.

If you are using an encoding that allows 1024 or more distinct characters, you can map your pairs to characters. Otherwise you simply can't. So ASCII is not going to work here, and neither is Latin1. But pretty much any of the "normal" Unicode encodings are fine.

Keep in mind that for something like UTF-8, the actual character will take up more than 10 bits. If that's a concern, consider using UTF-16 or so.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM