简体繁体 English

JVM修改的UTF-8中v的格式

[英]Format of v in the JVM modified UTF-8

原文 2017-01-10 01:21:41 8 1 java/ utf-8/ jvm

In the JVM specification, in the description of the modified UTF-8 , it states the format of v for the "two-times-three-byte format": 在JVM规范中，在对修改后的UTF-8的描述中，它声明了“两次三字节格式”的v格式：

This means supplementary characters are represented by six bytes, u, v, w, x, y, and z 这意味着补充字符由六个字节表示：u，v，w，x，y和z

Table 4.14. 表4.14。 v: 1010 (bits 20-16)-1 v：1010（位20-16）-1

Since v is 8 bits, it means that (bits 20-16)-1 has to be 4 bits. 由于v是8位，这意味着(bits 20-16)-1必须为4位。 How can the -1 shrink bits 20-26 from 5 to 4 bits? -1如何将bits 20-26从5位缩小到4位？

(Supplementary question: is there any reason to say "two-times-three-byte" rather than "six-byte"?) （补充问题：是否有理由说“两倍三字节”而不是“六字节”？）

1 个解决方案

Unicode code points are ranged from U+0000 to U+10FFFF . Unicode代码点的范围是U+0000到U+10FFFF 。

Values greater than U+FFFF are called supplementary code points . 大于U+FFFF值称为补充代码点 。 Their binary representation is uuuuuxxxxxxxxxxxxxxxx (21 bits), where uuuuu is between 00001 and 10000 . 它们的二进制表示形式是uuuuuxxxxxxxxxxxxxxxx （21位），其中uuuuu在00001和10000之间。

In UTF-16 supplementary code points are encoded by surrogate pairs as described in 3.9 Unicode Encoding Forms, D91 . 在UTF-16中，补充代码点由代理对编码，如3.9 Unicode编码形式D91中所述。 That is, uuuuuxxxxxxxxxxxxxxxx is represented by two 16-bit characters: 也就是说， uuuuuxxxxxxxxxxxxxxxx由两个16位字符表示：
110110wwwwxxxxxx 110111xxxxxxxxxx , where wwww = uuuuu - 1 . 110110wwwwxxxxxx 110111xxxxxxxxxx ，其中wwww = uuuuu - 1 。

00001 ≤ uuuuu ≤ 10000 , therefore, 0000 ≤ wwww ≤ 1111 00001 ≤ uuuuu ≤ 10000 ，因此， 0000 ≤ wwww ≤ 1111

Now, modified UTF-8 encodes supplementary code points as if they were two characters: high surrogate and low surrogate. 现在，修改后的UTF-8对补充代码点进行编码，就好像它们是两个字符一样：高代理和低代理。 Each of these surrogate characters is represented by 3 bytes in UTF-8. 这些代理字符中的每一个都由UTF-8中的3个字节表示。 Hence 'two-times-three' figure. 因此， “两倍三”的数字。