简体   繁体   English

检查String是否在Java中是有效的UTF-8编码

[英]Check if a String is valid UTF-8 encoded in Java

如何检查字符串是否为有效的UTF-8格式?

Only byte data can be checked. 只能检查字节数据。 If you constructed a String then its already in UTF-16 internally. 如果你构造了一个String,那么它内部已经是UTF-16了。

Also only byte arrays can be UTF-8 encoded. 只有字节数组可以是UTF-8编码的。

Here is a common case of UTF-8 conversions. 以下是UTF-8转换的常见情况。

String myString = "\u0048\u0065\u006C\u006C\u006F World";
System.out.println(myString);
byte[] myBytes = null;

try 
{
    myBytes = myString.getBytes("UTF-8");
} 
catch (UnsupportedEncodingException e)
{
    e.printStackTrace();
    System.exit(-1);
}

for (int i=0; i < myBytes.length; i++) {
    System.out.println(myBytes[i]);
}

If you don't know the encoding of your byte array, juniversalchardet is a library to help you detect it. 如果您不知道字节数组的编码, juniversalchardet是一个帮助您检测它的库。

The following post is taken from the official Java tutorials available at: https://docs.oracle.com/javase/tutorial/i18n/text/string.html . 以下帖子取自官方Java教程: https//docs.oracle.com/javase/tutorial/i18n/text/string.html

The StringConverter program starts by creating a String containing Unicode characters: StringConverter程序首先创建一个包含Unicode字符的String:

 String original = new String("A" + "\ê" + "\ñ" + "\ü" + "C"); 

When printed, the String named original appears as: 打印时,名为original的String显示为:

 AêñüC 

To convert the String object to UTF-8, invoke the getBytes method and specify the appropriate encoding identifier as a parameter. 要将String对象转换为UTF-8,请调用getBytes方法并将相应的编码标识符指定为参数。 The getBytes method returns an array of bytes in UTF-8 format. getBytes方法返回UTF-8格式的字节数组。 To create a String object from an array of non-Unicode bytes, invoke the String constructor with the encoding parameter. 要从非Unicode字节数组创建String对象,请使用encoding参数调用String构造函数。 The code that makes these calls is enclosed in a try block, in case the specified encoding is unsupported: 如果不支持指定的编码,则进行这些调用的代码将包含在try块中:

 try { byte[] utf8Bytes = original.getBytes("UTF8"); byte[] defaultBytes = original.getBytes(); String roundTrip = new String(utf8Bytes, "UTF8"); System.out.println("roundTrip = " + roundTrip); System.out.println(); printBytes(utf8Bytes, "utf8Bytes"); System.out.println(); printBytes(defaultBytes, "defaultBytes"); } catch (UnsupportedEncodingException e) { e.printStackTrace(); } 

The StringConverter program prints out the values in the utf8Bytes and defaultBytes arrays to demonstrate an important point: The length of the converted text might not be the same as the length of the source text. StringConverter程序打印出utf8Bytes和defaultBytes数组中的值以演示一个重要的点:转换后的文本的长度可能与源文本的长度不同。 Some Unicode characters translate into single bytes, others into pairs or triplets of bytes. 一些Unicode字符转换为单个字节,其他字符转换为成对或三字节字节。 The printBytes method displays the byte arrays by invoking the byteToHex method, which is defined in the source file, UnicodeFormatter.java. printBytes方法通过调用byteToHex方法显示字节数组,该方法在源文件UnicodeFormatter.java中定义。 Here is the printBytes method: 这是printBytes方法:

 public static void printBytes(byte[] array, String name) { for (int k = 0; k < array.length; k++) { System.out.println(name + "[" + k + "] = " + "0x" + UnicodeFormatter.byteToHex(array[k])); } } 

The output of the printBytes method follows. printBytes方法的输出如下。 Note that only the first and last bytes, the A and C characters, are the same in both arrays: 请注意,在两个数组中,只有第一个和最后一个字节A和C字符相同:

 utf8Bytes[0] = 0x41 utf8Bytes[1] = 0xc3 utf8Bytes[2] = 0xaa utf8Bytes[3] = 0xc3 utf8Bytes[4] = 0xb1 utf8Bytes[5] = 0xc3 utf8Bytes[6] = 0xbc utf8Bytes[7] = 0x43 defaultBytes[0] = 0x41 defaultBytes[1] = 0xea defaultBytes[2] = 0xf1 defaultBytes[3] = 0xfc defaultBytes[4] = 0x43 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM