简体   繁体   English

如何在Java中将八进制char序列转换为unicode

[英]How to convert octal char sequence to unicode in Java

Hi Have following string, 嗨有以下字符串,

Let\\342\\200\\231s start with the most obvious question first. This is what an \\342\\200\\234unfurl\\342\\200\\235 is

It is supposed to be displayed as The first three numbers ( \\342\\200\\231 ) actually represent a octal sequence http://graphemica.com/%E2%80%99 and its unicode equivalent is \’ 它应该显示为前三个数字( \\342\\200\\231 )实际上代表一个八进制序列http://graphemica.com/%E2%80%99 ,它的unicode等价物是\’

Similarly \\342\\200\\234 represents a octal sequence http://graphemica.com/%E2%80%9C and its unicode equivalent is \“ 类似地, \\342\\200\\234代表八进制序列http://graphemica.com/%E2%80%9C ,其unicode等价物是\“

Is there any library or function which I can use to convert these octal sequences to their unicode equivalent? 是否有任何库或函数可用于将这些八进制序列转换为它们的unicode等价物?

The bytes you show are (a representation of) UTF-8 encoding, which is only one of many forms of Unicode. 您显示的字节是(UTF-8编码的表示),它只是许多Unicode形式中的一种。 Java is designed to handle such encodings as byte sequences (such as arrays, and also streams), but not as chars and Strings. Java旨在处理诸如字节序列(例如数组,以及流)之类的编码,但不能用作字符和字符串。 The somewhat cleaner way is to actually use bytes, but then you have to deal with the fact that Java bytes are signed (-128 .. +127) and all multibyte UTF-8 codes are (by design) in the upper half of 8-bit space: 更简洁的方法是实际使用字节,但是你必须处理Java字节被签名的事实(-128 .. +127)和所有多字节UTF-8代码(按设计)在8的上半部分位空间:

byte[] a = {'L','e','t',(byte)0342,(byte)0200,(byte)0231,'s'};
System.out.println (new String (a,StandardCharsets.UTF_8));
// or arguably uglier
byte[] b = {'L','e','t',0342-256,0200-256,0231-256,'s'};
System.out.println (new String (b,StandardCharsets.UTF_8));

But if you want something closer to your original you can cheat just a little by treating a String (of unsigned chars) that actually contains the UTF-8 bytes as if it contained the 8-bit characters that form Unicode range 0000-00FF which is defined to be the same as ISO-8859-1: 但是如果你想要更接近原作的东西,你可以通过处理实际包含UTF-8字节的字符串( 无符号字符)来作弊,就好像它包含形成Unicode范围0000-00FF的8位字符,这是定义为与ISO-8859-1相同:

byte[] c = "Let\342\200\231s".getBytes(StandardCharsets.ISO_8859_1);
System.out.println (new String (c,StandardCharsets.UTF_8));

In Java, this is not possible with Octals, only with Hexa. 在Java中,Octals不可能实现这一点,只有Hexa才能实现。

This works fine: 这很好用:

System.out.println("\u2019");

It is probably for purely historical reasons that Java supports octal escape sequences at all. 可能纯粹由于历史原因,Java支持八进制转义序列。 These escape sequences originated in C (or maybe in C's predecessors B and BCPL), in the days when computers like the PDP-7 ruled the Earth, and much programming was done in assembly or directly in machine code, and octal was the preferred number base for writing instruction codes, and there was no Unicode, just ASCII, so three octal digits were sufficient to represent the entire character set. 这些逃逸序列起源于C(或者可能是C的前身B和BCPL),在像PDP-7这样的计算机统治地球的时代,大量编程是在汇编或直接在机器代码中完成的,而八进制是首选的数字用于编写指令代码的基础,并且没有Unicode,只有ASCII,因此三个八进制数字足以表示整个字符集。

By the time Unicode and Java came along, octal had pretty much given way to hexadecimal as the preferred number base when decimal just wouldn't do. 当Unicode和Java出现时,八进制几乎已经让位于十六进制作为首选数字基数,而十进制不会这样做。 So Java has its \\u escape sequence that takes hexadecimal digits. 所以Java的\\ u转义序列采用十六进制数字。 The octal escape sequence was probably supported just to make C programmers comfortable, and to make it easy to copy'n'paste string constants from C programs into Java programs. 可能只支持八进制转义序列以使C程序员感到舒服,并且可以很容易地将C程序中的字符串常量复制到Java程序中。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM