简体   繁体   中英

Format of v in the JVM modified UTF-8

In the JVM specification, in the description of the modified UTF-8 , it states the format of v for the "two-times-three-byte format":

This means supplementary characters are represented by six bytes, u, v, w, x, y, and z

Table 4.14. v: 1010 (bits 20-16)-1

Since v is 8 bits, it means that (bits 20-16)-1 has to be 4 bits. How can the -1 shrink bits 20-26 from 5 to 4 bits?

(Supplementary question: is there any reason to say "two-times-three-byte" rather than "six-byte"?)

Unicode code points are ranged from U+0000 to U+10FFFF .

Values greater than U+FFFF are called supplementary code points . Their binary representation is uuuuuxxxxxxxxxxxxxxxx (21 bits), where uuuuu is between 00001 and 10000 .

In UTF-16 supplementary code points are encoded by surrogate pairs as described in 3.9 Unicode Encoding Forms, D91 . That is, uuuuuxxxxxxxxxxxxxxxx is represented by two 16-bit characters:
110110wwwwxxxxxx 110111xxxxxxxxxx , where wwww = uuuuu - 1 .

00001 ≤ uuuuu ≤ 10000 , therefore, 0000 ≤ wwww ≤ 1111

Now, modified UTF-8 encodes supplementary code points as if they were two characters: high surrogate and low surrogate. Each of these surrogate characters is represented by 3 bytes in UTF-8. Hence 'two-times-three' figure.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM