Java getBytes UTF-8 encoding

Question

I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...):

When I execute this simple code:

System.out.println(new String("&eacute;".getBytes("UTF-8"), "UTF-8"));

In the console I expect: 'é' but I get

&eacute;

Answer 1

é is the HTML entity reference for the é character, not the UTF-8 encoded string. To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils :

String decodedStr = StringEscapeUtils.unescapeHtml("&eacute;");

Answer 2

Java Strings know nothing of SGML / XML / HTML5 entities. é is such an entity. It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that &eacute is the letter e with accent acute by mapping it to the corresponding unicode character entity é .

new String(someString.getBytes("UTF-8"), "UTF-8"); is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. It's the same thing as using someString directly, just you have a new object.

In order to get e with accent acute, you can do one of the following things:

Directly type it, like System.out.println("é"); . This requires that your text editor and your Java compiler agree on the encoding of the source code file. If you're working in a project, it requires that everybody understands and agrees on a particular encoding. Recommended encoding these days certainly is UTF-8.
Use the Unicode character number. In the case of e acute it would be \é .

PS: SGML / XML / HTML5 entities have nothing to do with UTF-8.

Java getBytes UTF-8 encoding

Question

2 answers

solution1
7 ACCPTED 2015-01-07 21:09:13

solution2
1 2015-01-07 21:11:38

Java getBytes UTF-8 encoding

Question

2 answers

solution1 7 ACCPTED 2015-01-07 21:09:13

solution2 1 2015-01-07 21:11:38

solution1
7 ACCPTED 2015-01-07 21:09:13

solution2
1 2015-01-07 21:11:38