简体   繁体   中英

Java getBytes UTF-8 encoding

I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...):

When I execute this simple code:

System.out.println(new String("é".getBytes("UTF-8"), "UTF-8"));

In the console I expect: 'é' but I get

é 

é is the HTML entity reference for the é character, not the UTF-8 encoded string. To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils :

String decodedStr = StringEscapeUtils.unescapeHtml("é");

Java Strings know nothing of SGML / XML / HTML5 entities. é is such an entity. It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that &eacute is the letter e with accent acute by mapping it to the corresponding unicode character entity é .

new String(someString.getBytes("UTF-8"), "UTF-8"); is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. It's the same thing as using someString directly, just you have a new object.

In order to get e with accent acute, you can do one of the following things:

  • Directly type it, like System.out.println("é"); . This requires that your text editor and your Java compiler agree on the encoding of the source code file. If you're working in a project, it requires that everybody understands and agrees on a particular encoding. Recommended encoding these days certainly is UTF-8.
  • Use the Unicode character number. In the case of e acute it would be .

PS: SGML / XML / HTML5 entities have nothing to do with UTF-8.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM