[英]How can a 21 byte UTF-8 sequence come from just 5 characters?
After writing some basic code to count the number of characters in a String
, I've found one example where the UTF-8 encoded output creates 21 bytes from a 5 "character" String
. 在编写一些基本代码来计算
String
的字符数之后,我发现了一个例子,其中UTF-8编码输出从5“字符” String
创建21个字节。
Here's the output: 这是输出:
String ==¦ อภิชาติ ¦== Code units 7 UTF8 Bytes 21 8859 Bytes 7 Characters 5
I understand that Java's internal representation of a char
is 2 bytes and there is a possibility that some characters may require two Unicode code units to display them. 我知道Java的
char
的内部表示是2个字节,并且有些字符可能需要两个Unicode代码单元来显示它们。
As UTF-8 doesn't use any more than 4 bytes per character, how is a byte[]
length of more than 20 possible for a 5 character String
? 由于UTF-8每个字符不使用超过4个字节,因此对于5个字符的
String
byte[]
长度如何超过20?
Here's the source: 这是来源:
import java.io.UnsupportedEncodingException;
public class StringTest {
public static void main(String[] args) {
displayStringInfo("อภิชาติ");
}
public static void displayStringInfo(String s) {
System.out.println("Code units " + s.length());
try {
System.out.println("UTF8 Bytes " + s.getBytes("UTF-8").length);
} catch (UnsupportedEncodingException e) { // not handled }
System.out.println("Characters " + characterLength(s));
}
public static int characterLength(String s) {
int count = 0;
for(int i=0; i<s.length(); i++) {
if(!isLeadingUnit(s.charAt(i)) && !isMark(s.charAt(i))) count++;
}
return count;
}
private static boolean isMark(char ch) {
int type = Character.getType(ch);
return (type == Character.NON_SPACING_MARK ||
type == Character.ENCLOSING_MARK ||
type == Character.COMBINING_SPACING_MARK);
}
private static boolean isLeadingUnit(char ch) {
return Character.isHighSurrogate(ch);
}
}
Your "5 character" string actually consists of 7 Unicode code points: 您的“5个字符”字符串实际上由7个Unicode代码点组成:
All of them are in the U+0800 to U+FFFF range that requires 3 bytes per character in UTF-8, hence a total length of 7×3 = 21 bytes. 所有这些都在U + 0800到U + FFFF范围内,在UTF-8中每个字符需要3个字节,因此总长度为7×3 = 21个字节。
There're 7 characters in the string: 字符串中有7个字符:
' อ' (0x0e2d) encoded as {0xe0, 0xb8, 0xad}
'ภ' (0x0e20) - / - {0xe0, 0xb8, 0xa0}
' ิ' (0x0e34) - / - {0xe0, 0xb8, 0xb4}
'ช' (0x0e0a) - / - {0xe0, 0xb8, 0x8a}
'า' (0x0e32) - / - {0xe0, 0xb8, 0xb2}
'ต' (0x0e15) - / - {0xe0, 0xb8, 0x95}
' ิ' (0x0e34) - / - {0xe0, 0xb8, 0xb4}
each symbol is encoded by three bytes in UTF-8 and so you have 7 * 3 == 21
bytes altogeter 每个符号由UTF-8中的三个字节编码,因此您有
7 * 3 == 21
字节的altogeter
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.