Java getBytes UTF-8编码

Question

I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...): 我正在尝试解决编码问题（我想将特殊字符从字符串转换为正确的UTF-8字符...）：

When I execute this simple code: 当我执行以下简单代码时：

System.out.println(new String("&eacute;".getBytes("UTF-8"), "UTF-8"));

In the console I expect: 'é' but I get 在控制台中，我期望：'é'但我得到了

&eacute;

Answer 1

é is the HTML entity reference for the é character, not the UTF-8 encoded string. 是é字符的HTML实体引用，而不是UTF-8编码的字符串。 To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils : 要对其进行解码，可以使用Commons Lang的org.apache.commons.lang.StringEscapeUtils ：

String decodedStr = StringEscapeUtils.unescapeHtml("&eacute;");

Answer 2

Java Strings know nothing of SGML / XML / HTML5 entities. Java字符串对SGML / XML / HTML5实体一无所知。 é is such an entity. 是这样的实体。 It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that &eacute is the letter e with accent acute by mapping it to the corresponding unicode character entity é 它可以在HTML内的Web浏览器中使用，因为在DTD之一或HTML5规范中，通过将&eacute映射到相应的Unicode字符实体é来定义&eacute是带有重音符号的字母e é . 。

new String(someString.getBytes("UTF-8"), "UTF-8"); is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. 这是一个无意义的操作，它将String转换为字节，并使用可以表示所有有意义字符的编码，然后将其转换回String。 It's the same thing as using someString directly, just you have a new object. 与直接使用someString相同，只是您有一个新对象。

In order to get e with accent acute, you can do one of the following things: 为了使e带有重音，您可以执行以下操作之一：

Directly type it, like System.out.println("é"); 直接键入它，例如System.out.println("é"); . 。 This requires that your text editor and your Java compiler agree on the encoding of the source code file. 这要求您的文本编辑器和Java编译器就源代码文件的编码达成共识。 If you're working in a project, it requires that everybody understands and agrees on a particular encoding. 如果您在一个项目中工作，它要求每个人都理解并同意特定的编码。 Recommended encoding these days certainly is UTF-8. 这些天推荐的编码当然是UTF-8。
Use the Unicode character number. 使用Unicode字符编号。 In the case of e acute it would be \é . 在e紧急情况下为\é 。

PS: SGML / XML / HTML5 entities have nothing to do with UTF-8. PS：SGML / XML / HTML5实体与UTF-8无关。

Java getBytes UTF-8编码

问题描述

2 个解决方案

解决方案1
7 已采纳 2015-01-07 21:09:13

解决方案2
1 2015-01-07 21:11:38

Java getBytes UTF-8编码

问题描述

2 个解决方案

解决方案1 7 已采纳 2015-01-07 21:09:13

解决方案2 1 2015-01-07 21:11:38

解决方案1
7 已采纳 2015-01-07 21:09:13

解决方案2
1 2015-01-07 21:11:38