简体   繁体   中英

Python3 counting UTF-16 code points in a string

I am trying to figure out how to either convert UTF-16 offsets to UTF-8 offsets, or somehow be able to count the # of UTF-16 code points in a string. (I think in order to do the former, you have to do the latter anyways.)

Sanity check: I am correct that the len() function, when operated on a python string returns the number of code points in it in UTF-8 ?

I need to do this because the LSP protocol requires the offsets to be in UTF-16 , and I am trying to build something with LSP in mind.

I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.

Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.

In Python-3, strings are represented as str objects, which are conceptually vectors of unicode codepoints. So the length of a str is the number of Unicode characters it contains, and len("") is 1, just as with any other single character. That's independent of the fact that "" requires two UTF-16 code units (or four UTF-8 code units).

Python-3 also has a bytes object, which is a vector of bytes (as its name suggests). You can encode a str into a sequence of bytes using the encode method, specifying some encoding. So if you want to produce the stream of bytes representing the character "" in UTF-16LE, you would invoke "".encode('utf-16-le') .

Specifying le (for little-endian) is important because encode produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16') , you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.

Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2 .

But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (2 16 ):

def utf16len(c):
    """Returns the length of the single character 'c'
       in UTF-16 code units."""
    return 1 if ord(c) < 65536 else 2

For counting the bytes, including BOM, len(str.encode("utf-16")) would work. You can use utf-16-le for bytes without BOM.

Example:

>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8

As for your question: No, len(str) in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM