简体   繁体   English

验证字符串是否为 Java 中的 UTF-8 编码

[英]Verifying a string is UTF-8 encoded in Java

There are plenty of to how to check if a string is UTF-8 encoded, for example:有很多方法可以检查字符串是否为 UTF-8 编码,例如:

public static boolean isUTF8(String s){
    try{
        byte[]bytes = s.getBytes("UTF-8");
    }catch(UnsupportedEncodingException e){
        e.printStackTrace();
        System.exit(-1);
    }
    return true;
}

The doc of java.lang.String#getBytes(java.nio.charset.Charset) says: java.lang.String#getBytes(java.nio.charset.Charset)的文档说:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement byte array.此方法始终使用此字符集的默认替换字节数组替换格式错误的输入和不可映射的字符序列。

  1. Is it correct that it always returns correct UTF-8 bytes?它总是返回正确的 UTF-8 字节是否正确?
  2. Does it make sense to perform such checks on String objects at all?String对象执行此类检查是否有意义? Won't it always be returning true as a String object is already encoded?因为 String 对象已经被编码,它不会总是返回true吗?
  3. As far as I understand such checks should be performed on bytes, not on String objects:据我了解,此类检查应在字节上执行,而不是在String对象上执行:
public static final boolean isUTF8(final byte[] inputBytes) {
    final String converted = new String(inputBytes, StandardCharsets.UTF_8);
    final byte[] outputBytes = converted.getBytes(StandardCharsets.UTF_8);
    return Arrays.equals(inputBytes, outputBytes);
}

But in this case I'm not sure I understand where I should take those butes from as getting it straight from the String object will no be correct.但在这种情况下,我不确定我应该从哪里获取这些butes,因为直接从String对象获取它是不正确的。

Is it correct that it always returns correct UTF-8 bytes?它总是返回正确的 UTF-8 字节是否正确?

Yes.是的。

Does it make sense to perform such checks on String objects at all?对 String 对象执行此类检查是否有意义? Won't it always be returning true as a String object is already encoded?因为 String 对象已经被编码,它不会总是返回 true 吗?

Java strings use Unicode characters encoded in UTF-16. Java 字符串使用以 UTF-16 编码的 Unicode 字符。 Since UTF-16 uses surrogate pairs, any unpaired surrogate is invalid, so Java strings can contain invalid char sequences.由于 UTF-16 使用代理对,任何未配对的代理都是无效的,因此 Java 字符串可以包含无效的char序列。

Java strings can also contain characters that are unassigned in Unicode. Java 字符串还可以包含在 Unicode 中未分配的字符。

Which means that performing validation on a Java String makes sense, though it is very rarely done.这意味着对 Java String执行验证是有意义的,尽管很少这样做。

As far as I understand such checks should be performed on bytes, not on String objects.据我了解,此类检查应该对字节执行,而不是对 String 对象执行。

Depending on the character set of the bytes, there is nothing to validate, eg character set CP437 maps all 256 byte values, so it cannot be invalid.根据字节的字符集,没有什么可验证的,例如字符集 CP437 映射所有 256 个字节值,因此它不能无效。

UTF-8 can be invalid, so you're correct that validating bytes is useful. UTF-8 可能无效,因此验证字节很有用是正确的。


As the javadoc said, getBytes(Charset) always replaces malformed-input and unmappable-character sequences with the charset's default replacement byte.正如 javadoc 所说, getBytes(Charset)总是用字符集的默认替换字节替换格式错误的输入和不可映射的字符序列。

That is because it does this:那是因为它这样做:

CharsetEncoder encoder = charset.newEncoder()
        .onMalformedInput(CodingErrorAction.REPLACE)
        .onUnmappableCharacter(CodingErrorAction.REPLACE);

If you want to get the bytes, but fail on malformed-input and unmappable-character sequences, use CodingErrorAction.REPORT instead.如果您想获取字节,但在格式错误的输入和不可映射的字符序列上失败,请改用CodingErrorAction.REPORT Since that's actually the default, simply don't call the two onXxx() methods.因为这实际上是默认设置,所以不要调用这两个onXxx()方法。

Example例子

String s = "\uD800"; // unpaired surrogate
System.out.println(Arrays.toString(s.getBytes(StandardCharsets.UTF_8)));

That prints [63] which is a ?打印出[63]这是一个? , ie the unpaired surrogate is malformed-input, so it was replaced with the replacement byte. ,即未配对的代理是格式错误的输入,因此它被替换为替换字节。

String s = "\uD800"; // unpaired surrogate

CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
ByteBuffer encoded = encoder.encode(CharBuffer.wrap(s.toCharArray()));
byte[] bytes = new byte[encoded.remaining()];
encoded.get(bytes);

System.out.println(Arrays.toString(bytes));

That causes MalformedInputException: Input length = 1 since the default malformed-input action is REPORT .这会导致MalformedInputException: Input length = 1因为默认的格式错误输入操作是REPORT

Your function as shown makes no sense.如图所示,您的功能毫无意义。 As the documentation says:正如文档所说:

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). String表示 UTF-16 格式的字符串,其中补充字符代理对表示(有关更多信息,请参阅Character类中的Unicode 字符表示部分)。 Index values refer to char code units, so a supplementary character uses two positions in a String .索引值指的是char代码单元,因此增补字符使用String两个位置。

A String is comprised of UTF-16 encoded characters, not UTF-8. String由 UTF-16 编码的字符组成,而不是 UTF-8。 A String will NEVER be encoded in UTF-8, but it can ALWAYS be converted to UTF-8, so your function will ALWAYS return true . String永远不会以 UTF-8 编码,但它始终可以转换为 UTF-8,因此您的函数将始终返回true "UTF-8" is a standard encoding supported by all Java implementations, so getBytes("UTF-8") will NEVER throw UnsupportedEncodingException , which is raised only when an unsupported charset is used. "UTF-8" 是所有 Java 实现都支持的标准编码,因此getBytes("UTF-8")永远不会抛出UnsupportedEncodingException ,只有在使用不受支持的字符集时才会引发该异常。

Your function would make more sense only if it took a byte[] as input instead.只有将byte[]作为输入,您的函数才会更有意义。 But even then, doing a double-encode and comparing the results is not efficient.但即便如此,进行双重编码并比较结果也效率不高。 As the documentation says:正如文档所说:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified.当给定字节在给定字符集中无效时,此构造函数的行为未指定。 The CharsetDecoder class should be used when more control over the decoding process is required.当需要对解码过程进行更多控制时,应使用CharsetDecoder类。

For example:例如:

public static boolean isUTF8(byte[] bytes){
    try{
        StandardCharset.UTF_8.newDecoder()
         .onMalformedInput(CodingErrorAction.REPORT)
         .onUnmappableCharacter(CodingErrorAction.REPORT)
         .decode(ByteBuffer.wrap(bytes)); 
    }
    catch (CharacterCodingException e){
        return false;
    }
    return true;
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM