I am trying to figure out how to either convert UTF-16
offsets to UTF-8
offsets, or somehow be able to count the # of UTF-16
code points in a string. (I think in order to do the former, you have to do the latter anyways.)
Sanity check: I am correct that the len()
function, when operated on a python string returns the number of code points in it in UTF-8
?
I need to do this because the LSP protocol requires the offsets to be in UTF-16
, and I am trying to build something with LSP in mind.
I can't seem to find how to do this, the only python LSP server I know of doesn't even handle this conversion itself.
Python has two datatypes which can be used for characters, neither of which natively represents UTF-16 code units.
In Python-3, strings are represented as str
objects, which are conceptually vectors of unicode codepoints. So the length of a str
is the number of Unicode characters it contains, and len("")
is 1, just as with any other single character. That's independent of the fact that ""
requires two UTF-16 code units (or four UTF-8 code units).
Python-3 also has a bytes
object, which is a vector of bytes (as its name suggests). You can encode a str
into a sequence of bytes using the encode
method, specifying some encoding. So if you want to produce the stream of bytes representing the character "" in UTF-16LE, you would invoke "".encode('utf-16-le')
.
Specifying le
(for little-endian) is important because encode
produces a stream of bytes, not UTF-16 code units, and each code unit requires two bytes since it's a 16-bit number. If you don't specify a byte order, as in encode('utf-16')
, you'll find a two-byte UFtF-16 Byte Order Mark at the beginning of the encoded stream.
Since the UTF-16 encoding requires exactly two bytes for each UTF-16 code unit, you can get the UTF-16 length of a unicode string by dividing the length of the encoded bytes object by two: s.encode('utf-16-le')//2
.
But that's a pretty clunky way to convert between UTF-16 offsets and character indexes. Instead, you can just use the fact that characters representable with a single UTF-16 code unit are precisely the characters with codepoints less than 65536 (2 16 ):
def utf16len(c):
"""Returns the length of the single character 'c'
in UTF-16 code units."""
return 1 if ord(c) < 65536 else 2
For counting the bytes, including BOM, len(str.encode("utf-16"))
would work. You can use utf-16-le
for bytes without BOM.
Example:
>>> len("abcd".encode("utf-16"))
10
>>> len("abcd".encode("utf-16-le"))
8
As for your question: No, len(str)
in Python checks the number of decoded characters. If a character takes 4 UTF-8 code points, it still counts as 1.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.