简体   繁体   中英

Encode a string to fixed-width unicode UCS-2 in Python

I need a fixed-width string encoding. From what I understood, UCS-2 and UCS-4 (also, ASCII) are such fixed-width encodings.

From what I understood, Python only supports a variable-width UTF-16 via s.encode('utf_16_le') . Is it true? Is there an easy way to encode into a unicode fixed-width encoding?

Context: I'm storing a string array in raw bytes and need a way to index into it to recover original strings. Index calculation is easier when all characters are fixed-width.

strings = ['asd', 'def']

# ascii
bytelens = list(map(len, strings))
bytes = ''.join(strings).encode('ascii')

# utf8
bytelens = []
bytes = bytearray()
for s in strings:
  e = s.encode('utf-8')
  bytelens.append(len(e))
  bytes.extend(e)

# i need bytelens to later recover original strings from the array bytes

As you can see, ASCII variant is very simple, and UTF-8 is more convoluted and 20% slower (probably because of many allocations and function calls). A true fixed-width UCS-2 would be a solution!

A follow-up question: how can I know if my string has characters from UCS-1 / UCS-2 / UCS-4? For UCS-1 there is str.isascii. Any ideas about UCS-2?

You are mixing various concepts.

In Python, you can just index a string (or an array). It doesn't matter the length of every character. But also in this case, I should warn you that one character is not a single/simple entity: if you need single entities, you should put together more characters (combining characters, eg accents, etc.).

UTF16 is variable width, but it is the same as UCS2, but for characters outside UCS2. So for most things, it doesn't matter, and if you have such characters, you just work with sometime low and high surrogates (like on many other computer languages, which supports only UCS2). But this is often not a problem, because you should not split a string at random places, but always at end of an entity.

UCS4 and UTF-32 are practically the same encoding: Unicode code points into 32-bit numbers. (Differences are just virtual, and on some definition, not for Unicode characters [UCS is based on an ISO which allowed more (higher) code-points, never allocated)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM