In the JVM specification, in the description of the modified UTF-8 , it states the format of v
for the "two-times-three-byte format":
This means supplementary characters are represented by six bytes, u, v, w, x, y, and z
Table 4.14. v: 1010 (bits 20-16)-1
Since v
is 8 bits, it means that (bits 20-16)-1
has to be 4 bits. How can the -1
shrink bits 20-26
from 5 to 4 bits?
(Supplementary question: is there any reason to say "two-times-three-byte" rather than "six-byte"?)
Unicode code points are ranged from U+0000
to U+10FFFF
.
Values greater than U+FFFF
are called supplementary code points . Their binary representation is uuuuuxxxxxxxxxxxxxxxx
(21 bits), where uuuuu
is between 00001
and 10000
.
In UTF-16 supplementary code points are encoded by surrogate pairs as described in 3.9 Unicode Encoding Forms, D91 . That is, uuuuuxxxxxxxxxxxxxxxx
is represented by two 16-bit characters:
110110wwwwxxxxxx 110111xxxxxxxxxx
, where wwww = uuuuu - 1
.
00001 ≤ uuuuu ≤ 10000
, therefore, 0000 ≤ wwww ≤ 1111
Now, modified UTF-8 encodes supplementary code points as if they were two characters: high surrogate and low surrogate. Each of these surrogate characters is represented by 3 bytes in UTF-8. Hence 'two-times-three' figure.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.