Java字符串编码（UTF-8）

Question

I have come across this line of legacy code, which I am trying to figure out: 我遇到过这一系列遗留代码，我想弄清楚：

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

As far as I can understand, it is encoding & decoding using the same charSet. 据我所知，它是使用相同的charSet进行编码和解码。

How is this different from the following? 这与以下有什么不同？

String newString = oldString;

Is there any scenario in which the two lines will have different outputs? 是否存在两条线路具有不同输出的情况？

ps: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky ! ps：只是为了澄清，是的，我知道Joel Spolsky关于编码的优秀文章！

Answer 1

This could be complicated way of doing 这可能是复杂的做法

String newString = new String(oldString);

This shortens the String is the underlying char[] used is much longer. 这缩短了String使用的底层char []要长得多。

However more specifically it will be checking that every character can be UTF-8 encoded. 但更具体地说，它将检查每个字符是否可以是UTF-8编码。

There are some "characters" you can have in a String which cannot be encoded and these would be turned into ? 你可以在一个字符串中有一些“字符”，这些字符无法编码，这些将被转换成?

Any character between \? and \? cannot be encoded and will be turned into '?' \\ uD800和\\ uDFFF之间的任何字符都无法编码，将变为“？”

String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));

prints 版画

false

Answer 2

How is this different from the following? 这与以下有什么不同？

This line of code here: 这行代码在这里：

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

constructs a new String object (ie a copy of oldString ), while this line of code: 构造一个新的String对象（即oldString的副本），而这行代码：

String newString = oldString;

declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString . 声明一个java.lang.String类型的新变量并初始化它以引用与变量oldString相同的String对象。

Is there any scenario in which the two lines will have different outputs? 是否存在两条线路具有不同输出的情况？

Absolutely: 绝对：

String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true

vs. 与

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;

a_horse_with_no_name (see comment) is right of course. a_horse_with_no_name（见评论）当然是对的。 The equivalent of 相当于

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

is 是

String newString = new String(oldString);

minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer. 减去Peter Lawrey在他的回答中解释的编码的细微差别。

Java字符串编码（UTF-8）

问题描述

2 个解决方案

解决方案1
22 已采纳 2012-01-13 17:09:37

解决方案2
4 2012-01-13 16:55:14

Java字符串编码（UTF-8）

问题描述

2 个解决方案

解决方案1 22 已采纳 2012-01-13 17:09:37

解决方案2 4 2012-01-13 16:55:14

解决方案1
22 已采纳 2012-01-13 17:09:37

解决方案2
4 2012-01-13 16:55:14