简体   繁体   中英

Java String encoding (UTF-8)

I have come across this line of legacy code, which I am trying to figure out:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

As far as I can understand, it is encoding & decoding using the same charSet.

How is this different from the following?

String newString = oldString;

Is there any scenario in which the two lines will have different outputs?

ps: Just to clarify, yes I am aware of the excellent article on encoding by Joel Spolsky !

This could be complicated way of doing

String newString = new String(oldString);

This shortens the String is the underlying char[] used is much longer.

However more specifically it will be checking that every character can be UTF-8 encoded.

There are some "characters" you can have in a String which cannot be encoded and these would be turned into ?

Any character between \? and \? cannot be encoded and will be turned into '?'

String oldString = "\uD800";
String newString = new String(oldString.getBytes("UTF-8"), "UTF-8");
System.out.println(newString.equals(oldString));

prints

false

How is this different from the following?

This line of code here:

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

constructs a new String object (ie a copy of oldString ), while this line of code:

String newString = oldString;

declares a new variable of type java.lang.String and initializes it to refer to the same String object as the variable oldString .

Is there any scenario in which the two lines will have different outputs?

Absolutely:

String newString = oldString;
boolean isSameInstance = newString == oldString; // isSameInstance == true

vs.

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));
 // isSameInstance == false (in most cases)    
boolean isSameInstance = newString == oldString;

a_horse_with_no_name (see comment) is right of course. The equivalent of

String newString = new String(oldString.getBytes("UTF-8"), "UTF-8"));

is

String newString = new String(oldString);

minus the subtle difference wrt the encoding that Peter Lawrey explains in his answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM