简体   繁体   English

为什么Java的String.getBytes()使用“ISO-8859-1”

[英]Why does Java's String.getBytes() uses “ISO-8859-1”

from java.lang.StringCoding : 来自java.lang.StringCoding:

String csn = (charsetName == null) ? "ISO-8859-1" : charsetName;

This is what is used from Java.lang.getBytes() , in linux jdk 7 I was always under the impression that UTF-8 is the default charset ? 这是从Java.lang.getBytes()中使用的,在linux jdk 7中我总是认为UTF-8是默认的字符集?

Thanks 谢谢

It is a bit complicated ... 这有点复杂......

Java tries to use the default character encoding to return bytes using String.getBytes(). Java 尝试使用默认字符编码来使用String.getBytes()返回字节。

  • The default charset is provided by the system file.encoding property. 默认字符集由系统file.encoding属性提供。
  • This is cached and there is no use in changing it via the System.setProperty(..) after the JVM starts. 这是缓存的,在JVM启动后通过System.setProperty(..)更改它是没有用的。
  • If the file.encoding property does not map to a known charset, then the UTF-8 is specified. 如果file.encoding属性未映射到已知的字符集,则指定UTF-8。

.... Here is the tricky part (which is probably never going to come into play) .... ....这是棘手的部分(可能永远不会发挥作用)....

If the system cannot decode or encode strings using the default charset (UTF-8 or another one), then there will be a fallback to ISO-8859-1. 如果系统无法使用默认字符集(UTF-8或其他字符串)对字符串进行解码或编码,则将回退到ISO-8859-1。 If the fallback does not work ... the system will fail! 如果后备不起作用......系统将失败!

.... Really ... (gasp!) ... Could it crash if my specified charset cannot be used, and UTF-8 or ISO-8859-1 are also unusable? ....真的......(喘息!)......如果我的指定字符集无法使用,UTF-8或ISO-8859-1也无法使用,它会崩溃吗?

Yes. 是。 The Java source comments state in the StringCoding.encode(...) method: String源码注释状态在StringCoding.encode(...)方法中:

// If we can not find ISO-8859-1 (a required encoding) then things are seriously wrong with the installation. //如果我们找不到ISO-8859-1(一个必需的编码)那么安装就会出现严重问题。

... and then it calls System.exit(1) ...然后调用System.exit(1)


So, why is there an intentional fallback to ISO-8859-1 in the getBytes() method? 那么,为什么在getBytes()方法中有意回退ISO-8859-1?

It is possible, although not probable, that the users JVM may not support decoding and encoding in UTF-8 or the charset specified on JVM startup. 虽然不太可能,但用户JVM可能不支持UTF-8中的解码和编码或JVM启动时指定的字符集。

Then, is the default charset used properly in the String class during getBytes()? 那么,在getBytes()期间,String类中是否正确使用了默认字符集?

No. However, the better question is ... 不。但是,更好的问题是......


Does String.getBytes() deliver what it promises? String.getBytes()是否提供了它所承诺的功能?

The contract as defined in the Javadoc is correct. Javadoc中定义的合同是正确的。

The behavior of this method when this string cannot be encoded in the default charset is unspecified. 未指定此字符串无法在默认字符集中进行编码时此方法的行为。 The CharsetEncoder class should be used when more control over the encoding process is required. 当需要对编码过程进行更多控制时,应使用CharsetEncoder类。


The good news (and better way of doing things) 好消息(以及更好的做事方式)

It is always advised to explicitly specify "ISO-8859-1" or "US-ASCII" or "UTF-8" or whatever character set you want when converting bytes into Strings of vice-versa -- unless -- you have previously obtained the default charset and made 100% sure it is the one you need. 始终建议明确指定“ISO-8859-1”或“US-ASCII”或“UTF-8”或将字节转换为字符串时所需的任何字符集,反之亦然 - 除非 - 您之前已获得默认的charset并100%确定它是你需要的。

Use this method instead: 请改用此方法:

public byte[] getBytes(String charsetName)

To find the default for your system, just use: 要查找系统的默认值,只需使用:

Charset.defaultCharset()

Hope that helps. 希望有所帮助。

The parameterless String.getBytes() method doesn't use ISO-8859-1 by default. 默认情况下,无参数的String.getBytes()方法使用ISO-8859-1。 It will use the default platform encoding, if that can be determined. 如果可以确定,它将使用默认平台编码。 If, however, that's either missing or is an unrecognized encoding, it falls back to ISO-8859-1 as a "default default". 但是,如果丢失或者是无法识别的编码,则它将作为“默认默认值”回退到ISO-8859-1。

You should very rarely see this in practice. 你应该很少在实践中看到这一点。 Normally the platform default encoding will be detected correctly. 通常,将正确检测平台默认编码。

However, I'd strongly suggest that you specify an explicit character encoding every time you perform an encode or decode operation. 但是,我强烈建议您在每次执行编码或解码操作时指定显式字符编码。 Even if you want the platform default, specify that explicitly. 即使您希望平台默认,也请明确指定。

That's for compatibility reason. 这是出于兼容性的原因。

Historically, all java methods on Windows and Unix not specifying a charset were using the common one at the time, that is "ISO-8859-1" . 从历史上看,Windows和Unix上没有指定字符集的所有Java方法当时都使用了常见的方法,即"ISO-8859-1"

As mentioned by Isaac and the javadoc, the default platform encoding is used (see Charset.java ) : 正如Isaac和javadoc所提到的,使用了默认的平台编码(参见Charset.java ):

594    public static Charset defaultCharset() {
595        if (defaultCharset == null) {
596            synchronized (Charset.class) {
597                String csn = AccessController.doPrivileged(
598                    new GetPropertyAction("file.encoding"));
599                Charset cs = lookup(csn);
600                if (cs != null)
601                    defaultCharset = cs;
602                else
603                    defaultCharset = forName("UTF-8");
604            }
605        }
606        return defaultCharset;
607    }

Always specify the charset when doing string to bytes or bytes to string conversion. 始终在执行字符串到字节或字节到字符串转换时指定字符集。

Even when, as is the case for String.getBytes() you still find a non deprecated method not taking the charset (most of them were deprecated when Java 1.1 appeared). 即使像String.getBytes()的情况一样,你仍然会发现一个不推荐使用charset的非弃用方法(当Java 1.1出现时,大多数方法都被弃用了)。 Just like with endianness, the platform format is irrelevant, what is relevant is the norm of the storage format. 就像字节顺序一样,平台格式无关紧要,相关的是存储格式的规范。

Elaborate on Skeet's answer (which is of course the correct one) 详细说明Skeet的答案(当然是正确答案)

In java.lang.String 's source getBytes() calls StringCoding.encode(char[] ca, int off, int len) which has on its first line : java.lang.String的源代码中, getBytes()调用StringCoding.encode(char[] ca, int off, int len) ,它在第一行有:

String csn = Charset.defaultCharset().name();

Then (not immediately but absolutely) it calls static byte[] StringEncoder.encode(String charsetName, char[] ca, int off, int len) where the line you quoted comes from - passing as the charsetName the csn - so in this line the charsetName will be the default charset if one exists. 然后(不是立即但绝对)它调用static byte[] StringEncoder.encode(String charsetName, char[] ca, int off, int len) ,其中引用的行来自 - 作为charsetName传递csn - 所以在这一行charsetName 是默认字符集(如果存在)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM