简体   繁体   English

一个21字节的UTF-8序列如何仅来自5个字符?

[英]How can a 21 byte UTF-8 sequence come from just 5 characters?

After writing some basic code to count the number of characters in a String , I've found one example where the UTF-8 encoded output creates 21 bytes from a 5 "character" String . 在编写一些基本代码来计算String的字符数之后,我发现了一个例子,其中UTF-8编码输出从5“字符” String创建21个字节。

Here's the output: 这是输出:

 String ==¦ อภิชาติ ¦== Code units 7 UTF8 Bytes 21 8859 Bytes 7 Characters 5 

I understand that Java's internal representation of a char is 2 bytes and there is a possibility that some characters may require two Unicode code units to display them. 我知道Java的char的内部表示是2个字节,并且有些字符可能需要两个Unicode代码单元来显示它们。

As UTF-8 doesn't use any more than 4 bytes per character, how is a byte[] length of more than 20 possible for a 5 character String ? 由于UTF-8每个字符不使用超过4个字节,因此对于5个字符的String byte[]长度如何超过20?

Here's the source: 这是来源:

import java.io.UnsupportedEncodingException;

public class StringTest {

    public static void main(String[] args) {
        displayStringInfo("อภิชาติ");
    }

    public static void displayStringInfo(String s) {
        System.out.println("Code units " + s.length());     
        try {
            System.out.println("UTF8 Bytes " + s.getBytes("UTF-8").length);
        } catch (UnsupportedEncodingException e) { // not handled }
        System.out.println("Characters " + characterLength(s));
    }

    public static int characterLength(String s) {
        int count = 0;
        for(int i=0; i<s.length(); i++) {
            if(!isLeadingUnit(s.charAt(i)) && !isMark(s.charAt(i))) count++;
        }
        return count;
    }

    private static boolean isMark(char ch) {
        int type = Character.getType(ch);
        return (type == Character.NON_SPACING_MARK ||
               type == Character.ENCLOSING_MARK ||
               type == Character.COMBINING_SPACING_MARK);
    }

    private static boolean isLeadingUnit(char ch) {
        return Character.isHighSurrogate(ch);
    }
}

Your "5 character" string actually consists of 7 Unicode code points: 您的“5个字符”字符串实际上由7个Unicode代码点组成:

  • U+0E2D THAI CHARACTER O ANG U + 0E2D THAI CHARACTER O ANG
  • U+0E20 THAI CHARACTER PHO SAMPHAO U + 0E20 THAI CHARACTER PHO SAMPHAO
  • U+0E34 THAI CHARACTER SARA I U + 0E34 THAI CHARACTER SARA I
  • U+0E0A THAI CHARACTER CHO CHANG U + 0E0A THAI CHARACTER CHO CHANG
  • U+0E32 THAI CHARACTER SARA AA U + 0E32 THAI CHARACTER SARA AA
  • U+0E15 THAI CHARACTER TO TAO U + 0E15泰国人物
  • U+0E34 THAI CHARACTER SARA I U + 0E34 THAI CHARACTER SARA I

All of them are in the U+0800 to U+FFFF range that requires 3 bytes per character in UTF-8, hence a total length of 7×3 = 21 bytes. 所有这些都在U + 0800到U + FFFF范围内,在UTF-8中每个字符需要3个字节,因此总长度为7×3 = 21个字节。

There're 7 characters in the string: 字符串中有7个字符:

 ' อ' (0x0e2d) encoded as {0xe0, 0xb8, 0xad}
  'ภ' (0x0e20) - / -      {0xe0, 0xb8, 0xa0}
  ' ิ' (0x0e34) - / -      {0xe0, 0xb8, 0xb4}
  'ช' (0x0e0a) - / -      {0xe0, 0xb8, 0x8a}
  'า' (0x0e32) - / -      {0xe0, 0xb8, 0xb2}
  'ต' (0x0e15) - / -      {0xe0, 0xb8, 0x95}
  ' ิ' (0x0e34) - / -      {0xe0, 0xb8, 0xb4}

each symbol is encoded by three bytes in UTF-8 and so you have 7 * 3 == 21 bytes altogeter 每个符号由UTF-8中的三个字节编码,因此您有7 * 3 == 21字节的altogeter

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Java中替换/删除UTF-8字符串中的4(+)字节字符? - How to replace/remove 4(+)-byte characters from a UTF-8 string in Java? 2字节UTF-8序列的无效字节2:如何查找字符 - Invalid byte 2 of 2-byte UTF-8 sequence : How to find the character 如何修复 1 字节 UTF-8 序列的无效字节 1 - How to fix Invalid byte 1 of 1-byte UTF-8 sequence 如何删除XML中的特殊字符,并且在读取此xml文件时不应导致错误“1字节UTF-8序列的无效字节1” - How to remove the special characters in XML and should not lead to the error “Invalid byte 1 of 1-byte UTF-8 sequence” while reading this xml file 2 字节 UTF-8 序列的无效字节 2 - invalid byte 2 of 2-byte UTF-8 sequence 从 UTF-8 格式的字符串中提取双字节字符/子字符串 - Extracting Double Byte Characters/substring from a UTF-8 formatted String 从URL解析RSS给我“ 2字节UTF-8序列的无效字节2” - Parse RSS from URLs gives me “Invalid byte 2 of 2-byte UTF-8 sequence” 如何从输入流中读取Java字节范围之外的有效utf-8字符0xC2 0x85? - How can i read valid utf-8 characters 0xC2 0x85 from an input stream which are outside the byte range in java? 删除字符串中的0字节(UTF-8)字符 - Remove 0-byte (UTF-8) characters in String 3字节UTF-8序列xml转换无效的字节2 - Invalid byte 2 of 3-byte UTF-8 sequence xml transformation exception
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM