单个UTF-8字符到字节

Question

If I am converting a UTF-8 char to byte, will there ever be a difference in the result of these 3 implementations based on locale, environment, etc.? 如果我将UTF-8字符转换为字节，那么这3种基于语言环境，环境等的实现的结果是否会有所不同？

byte a = "1".getBytes()[0];
byte b = "1".getBytes(Charset.forName("UTF-8"))[0];
byte c = '1';

Answer 1

Your first line is dependent on the environment, because it will encode the string using the default character encoding of your system, which may or may not be UTF-8 . 你的第一行是取决于环境，因为这将编码使用系统的默认字符编码，这可能是也可能不是字符串UTF-8 。

Your second line will always produce the same result, no matter what the locale or the default character encoding of your system is. 无论系统的语言环境或默认字符编码是什么，第二行都将始终产生相同的结果。 It will always use UTF-8 to encode the string. 它将始终使用UTF-8编码字符串。

Note that UTF-8 is a variable-length character encoding. 请注意，UTF-8是可变长度字符编码。 Only the first 127 characters are encoded in one byte; 只有前127个字符被编码在一个字节中。 all other characters will take up between 2 and 6 bytes. 所有其他字符将占用2到6个字节。

Your third line casts a char to an int . 您的第三行将一个char转换为一个int 。 This will result in the int containing the UTF-16 character code of the character, since Java char stores characters using UTF-16. 这将导致int包含字符的UTF-16字符代码，因为Java char使用UTF-16存储字符。 Since UTF-16 partially encodes characters in the same way as UTF-8, the result will be the same as the second line, but this is not true in general for any character. 由于UTF-16以与UTF-8相同的方式对字符进行部分编码，因此结果将与第二行相同，但是通常对于任何字符都不是这样。

Answer 2

In principle the question is already answered, but I cannot resist to post a little scribble, for those who like to play around with code: 原则上，这个问题已经回答了，但是对于那些喜欢玩代码的人，我无法抗拒地写些涂鸦：

import java.nio.charset.Charset;

public class EncodingTest {

    private static void checkCharacterConversion(String c) {
        byte asUtf8 = c.getBytes(Charset.forName("UTF-8"))[0];
        byte asDefaultEncoding = c.getBytes()[0];
        byte directConversion = (byte)c.charAt(0);
        if (asUtf8 != asDefaultEncoding) {
            System.out.println(String.format(
                "First char of %s has different result in UTF-8 %d and default encoding %d",
                c, asUtf8, asDefaultEncoding));
        }
        if (asUtf8 != directConversion) {
            System.out.println(String.format(
                "First char of %s has different result in UTF-8 %d and direct as byte %d",
                c, asUtf8, directConversion));
        }
    }

    public static void main(String[] argv) {

       // btw: first time I ever wrote a for loop with a char - feels weird to me
       for (char c = '\0'; c <= '\u007f'; c++) {
           String cc = new String(new char[] {c});
           checkCharacterConversion(cc);
       }
    }
}

If you run this eg with: 如果运行此命令，例如：

java -Dfile.encoding="UTF-16LE"  EncodingTest

you will get no output. 您将不会获得任何输出。 But of course every single byte (ok, except for the first) will be wrong if you try: 但是，当然，如果您尝试执行以下操作，则每个字节（除第一个字节外）都是错误的：

java -Dfile.encoding="UTF-16BE"  EncodingTest

because in "big endian" the first byte is always zero for ascii chars. 因为在“ big endian”中，ASCII字符的第一个字节始终为零。 That is because in UTF-16 an ascii character '\\u00xy is represented by two bytes, in UTF16-LE as [xy, 0] and in UTF16-BE as [0, xy] 这是因为在UTF-16中，ASCII字符'\\u00xy由两个字节表示，在UTF16-LE中为[xy, 0] ，在UTF16-BE中为[0, xy]

However only the first statement produces any output, so b and c are indeed the same for the first 127 ascii characters - because in UTF-8 they are encoded by a single byte. 但是，只有第一个语句产生任何输出，因此b和c对于前127个ascii字符确实是相同的-因为在UTF-8中它们是由单个字节编码的。 This will not be true for any further characters, however; 但是，这对于其他任何字符都不适用。 they all have multi-byte representations in UTF-8. 它们都有UTF-8的多字节表示形式。

单个UTF-8字符到字节

问题描述

2 个解决方案

解决方案1
4 已采纳 2015-04-02 20:05:42

解决方案2
1 2015-04-02 20:55:02

单个UTF-8字符到字节

问题描述

2 个解决方案

解决方案1 4 已采纳 2015-04-02 20:05:42

解决方案2 1 2015-04-02 20:55:02

解决方案1
4 已采纳 2015-04-02 20:05:42

解决方案2
1 2015-04-02 20:55:02