简体   繁体   English

java.lang.String 为什么不验证编码?

[英]How come java.lang.String does not validate encoding?

I ran in to something that surprised me a little.我遇到了让我有点惊讶的事情。 When trying to build a string from bytes that are not proper utf-8, the String constructor still gives me a result.当尝试从不正确的 utf-8 字节构建字符串时,String 构造函数仍然给我一个结果。 No exception is thrown.不会抛出任何异常。 Example:例子:

byte[] x = { (byte) 0xf0, (byte) 0xab };
new String(x, "UTF-8"); // This works, or at least gives a result

// This however, throws java.nio.charset.MalformedInputException: Input length = 3
ByteBuffer wrapped = ByteBuffer.wrap(x);
CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
decoder.decode(wrapped);

Trying the same thing in for example python also gives an error, with a somewhat clearer error message:在例如 python 中尝试同样的事情也会出现错误,并带有更清晰的错误消息:

   >>> '\xf0\xab'.decode('utf-8')
   Traceback (most recent call last):
     File "<input>", line 1, in <module>
     File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
       return codecs.utf_8_decode(input, errors, True)
   UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: unexpected end of data

So why is it that the java string constructor seems to ignore errors in the input?那么为什么java字符串构造函数似乎忽略了输入中的错误呢?

update: I should be a little more clear.更新:我应该更清楚一点。 The javadoc points out that this is unspecified. javadoc 指出这是未指定的。 But what could the reason be for implementing it like this?但是,像这样实施它的原因是什么? It seems to me you would never want this sort of behavior and any time you can't be 100% sure of the source you would need to use the CharsetDecoder to be safe.在我看来,您永远不会想要这种行为,并且在您无法 100% 确定来源时,您需要使用 CharsetDecoder 来确保安全。

The Java documentation for String(byte[], String) says: String(byte[], String)的 Java 文档说:

The behavior of this constructor when the given bytes are not valid in the given charset is unspecified .当给定字节在给定字符集中无效时,此构造函数的行为未指定 The CharsetDecoder class should be used when more control over the decoding process is required.当需要对解码过程进行更多控制时,应使用 CharsetDecoder 类。

Thee constructor String(byte[], Charset) has yet another behavior:构造函数String(byte[], Charset)还有另一种行为:

This method always replaces malformed-input and unmappable-character sequences with this charset's default replacement string .此方法始终使用此字符集的默认替换字符串替换格式错误的输入和不可映射的字符序列 The CharsetDecoder class should be used when more control over the decoding process is required.当需要对解码过程进行更多控制时,应使用 CharsetDecoder 类。

I like Phython's behavior better.我更喜欢 Phython 的行为。 But you can't expect Java to be exactly like Python.但是您不能期望 Java 与 Python 完全一样。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM