简体   繁体   English

Java getBytes UTF-8编码

[英]Java getBytes UTF-8 encoding

I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...): 我正在尝试解决编码问题(我想将特殊字符从字符串转换为正确的UTF-8字符...):

When I execute this simple code: 当我执行以下简单代码时:

System.out.println(new String("é".getBytes("UTF-8"), "UTF-8"));

In the console I expect: 'é' but I get 在控制台中,我期望:'é'但我得到了

é 

é is the HTML entity reference for the é character, not the UTF-8 encoded string. é字符的HTML实体引用,而不是UTF-8编码的字符串。 To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils : 要对其进行解码,可以使用Commons Lang的org.apache.commons.lang.StringEscapeUtils

String decodedStr = StringEscapeUtils.unescapeHtml("é");

Java Strings know nothing of SGML / XML / HTML5 entities. Java字符串对SGML / XML / HTML5实体一无所知。 é is such an entity. 是这样的实体。 It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that &eacute is the letter e with accent acute by mapping it to the corresponding unicode character entity é 它可以在HTML内的Web浏览器中使用,因为在DTD之一或HTML5规范中,通过将&eacute映射到相应的Unicode字符实体é来定义&eacute是带有重音符号的字母e é .

new String(someString.getBytes("UTF-8"), "UTF-8"); is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. 这是一个无意义的操作,它将String转换为字节,并使用可以表示所有有意义字符的编码,然后将其转换回String。 It's the same thing as using someString directly, just you have a new object. 与直接使用someString相同,只是您有一个新对象。

In order to get e with accent acute, you can do one of the following things: 为了使e带有重音,您可以执行以下操作之一:

  • Directly type it, like System.out.println("é"); 直接键入它,例如System.out.println("é"); . This requires that your text editor and your Java compiler agree on the encoding of the source code file. 这要求您的文本编辑器和Java编译器就源代码文件的编码达成共识。 If you're working in a project, it requires that everybody understands and agrees on a particular encoding. 如果您在一个项目中工作,它要求每个人都理解并同意特定的编码。 Recommended encoding these days certainly is UTF-8. 这些天推荐的编码当然是UTF-8。
  • Use the Unicode character number. 使用Unicode字符编号。 In the case of e acute it would be . 在e紧急情况下为

PS: SGML / XML / HTML5 entities have nothing to do with UTF-8. PS:SGML / XML / HTML5实体与UTF-8无关。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM