[英]Java getBytes UTF-8 encoding
I am trying to deal with an encoding problem (I want to transform the special characters from a string into correct UTF-8 characters...): 我正在尝试解决编码问题(我想将特殊字符从字符串转换为正确的UTF-8字符...):
When I execute this simple code: 当我执行以下简单代码时:
System.out.println(new String("é".getBytes("UTF-8"), "UTF-8"));
In the console I expect: 'é' but I get 在控制台中,我期望:'é'但我得到了
é
é
is the HTML entity reference for the é
character, not the UTF-8 encoded string. 是é
字符的HTML实体引用,而不是UTF-8编码的字符串。 To decode it, you can use Commons Lang's org.apache.commons.lang.StringEscapeUtils
: 要对其进行解码,可以使用Commons Lang的org.apache.commons.lang.StringEscapeUtils
:
String decodedStr = StringEscapeUtils.unescapeHtml("é");
Java Strings know nothing of SGML / XML / HTML5 entities. Java字符串对SGML / XML / HTML5实体一无所知。 é
is such an entity. 是这样的实体。 It works in web browsers inside HTML because in one of the DTDs, or the HTML5 spec, it's defined that é
is the letter e with accent acute by mapping it to the corresponding unicode character entity é
它可以在HTML内的Web浏览器中使用,因为在DTD之一或HTML5规范中,通过将é
映射到相应的Unicode字符实体é
来定义é
是带有重音符号的字母e é
. 。
new String(someString.getBytes("UTF-8"), "UTF-8");
is a meaningless operation, it converts a String into bytes, with an encoding that can represent all meaningful characters, and converts it back into a String. 这是一个无意义的操作,它将String转换为字节,并使用可以表示所有有意义字符的编码,然后将其转换回String。 It's the same thing as using someString
directly, just you have a new object. 与直接使用someString
相同,只是您有一个新对象。
In order to get e with accent acute, you can do one of the following things: 为了使e带有重音,您可以执行以下操作之一:
System.out.println("é");
直接键入它,例如System.out.println("é");
. 。 This requires that your text editor and your Java compiler agree on the encoding of the source code file. 这要求您的文本编辑器和Java编译器就源代码文件的编码达成共识。 If you're working in a project, it requires that everybody understands and agrees on a particular encoding. 如果您在一个项目中工作,它要求每个人都理解并同意特定的编码。 Recommended encoding these days certainly is UTF-8. 这些天推荐的编码当然是UTF-8。 \é
. 在e紧急情况下为\é
。 PS: SGML / XML / HTML5 entities have nothing to do with UTF-8. PS:SGML / XML / HTML5实体与UTF-8无关。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.