[英]Format of v in the JVM modified UTF-8
In the JVM specification, in the description of the modified UTF-8 , it states the format of v
for the "two-times-three-byte format": 在JVM规范中,在对修改后的UTF-8的描述中 ,它声明了“两次三字节格式”的v
格式:
This means supplementary characters are represented by six bytes, u, v, w, x, y, and z 这意味着补充字符由六个字节表示:u,v,w,x,y和z
Table 4.14. 表4.14。 v: 1010 (bits 20-16)-1 v:1010(位20-16)-1
Since v
is 8 bits, it means that (bits 20-16)-1
has to be 4 bits. 由于v
是8位,这意味着(bits 20-16)-1
必须为4位。 How can the -1
shrink bits 20-26
from 5 to 4 bits? -1
如何将bits 20-26
从5位缩小到4位?
(Supplementary question: is there any reason to say "two-times-three-byte" rather than "six-byte"?) (补充问题:是否有理由说“两倍三字节”而不是“六字节”?)
Unicode code points are ranged from U+0000
to U+10FFFF
. Unicode代码点的范围是U+0000
到U+10FFFF
。
Values greater than U+FFFF
are called supplementary code points . 大于U+FFFF
值称为补充代码点 。 Their binary representation is uuuuuxxxxxxxxxxxxxxxx
(21 bits), where uuuuu
is between 00001
and 10000
. 它们的二进制表示形式是uuuuuxxxxxxxxxxxxxxxx
(21位),其中uuuuu
在00001
和10000
之间。
In UTF-16 supplementary code points are encoded by surrogate pairs as described in 3.9 Unicode Encoding Forms, D91 . 在UTF-16中,补充代码点由代理对编码,如3.9 Unicode编码形式D91中所述 。 That is, uuuuuxxxxxxxxxxxxxxxx
is represented by two 16-bit characters: 也就是说, uuuuuxxxxxxxxxxxxxxxx
由两个16位字符表示:
110110wwwwxxxxxx 110111xxxxxxxxxx
, where wwww = uuuuu - 1
. 110110wwwwxxxxxx 110111xxxxxxxxxx
,其中wwww = uuuuu - 1
。
00001 ≤ uuuuu ≤ 10000
, therefore, 0000 ≤ wwww ≤ 1111
00001 ≤ uuuuu ≤ 10000
,因此, 0000 ≤ wwww ≤ 1111
Now, modified UTF-8 encodes supplementary code points as if they were two characters: high surrogate and low surrogate. 现在,修改后的UTF-8对补充代码点进行编码,就好像它们是两个字符一样:高代理和低代理。 Each of these surrogate characters is represented by 3 bytes in UTF-8. 这些代理字符中的每一个都由UTF-8中的3个字节表示。 Hence 'two-times-three' figure. 因此, “两倍三”的数字。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.