简体   繁体   中英

How to convert octal char sequence to unicode in Java

Hi Have following string,

Let\\342\\200\\231s start with the most obvious question first. This is what an \\342\\200\\234unfurl\\342\\200\\235 is

It is supposed to be displayed as The first three numbers ( \\342\\200\\231 ) actually represent a octal sequence http://graphemica.com/%E2%80%99 and its unicode equivalent is \’

Similarly \\342\\200\\234 represents a octal sequence http://graphemica.com/%E2%80%9C and its unicode equivalent is \“

Is there any library or function which I can use to convert these octal sequences to their unicode equivalent?

The bytes you show are (a representation of) UTF-8 encoding, which is only one of many forms of Unicode. Java is designed to handle such encodings as byte sequences (such as arrays, and also streams), but not as chars and Strings. The somewhat cleaner way is to actually use bytes, but then you have to deal with the fact that Java bytes are signed (-128 .. +127) and all multibyte UTF-8 codes are (by design) in the upper half of 8-bit space:

byte[] a = {'L','e','t',(byte)0342,(byte)0200,(byte)0231,'s'};
System.out.println (new String (a,StandardCharsets.UTF_8));
// or arguably uglier
byte[] b = {'L','e','t',0342-256,0200-256,0231-256,'s'};
System.out.println (new String (b,StandardCharsets.UTF_8));

But if you want something closer to your original you can cheat just a little by treating a String (of unsigned chars) that actually contains the UTF-8 bytes as if it contained the 8-bit characters that form Unicode range 0000-00FF which is defined to be the same as ISO-8859-1:

byte[] c = "Let\342\200\231s".getBytes(StandardCharsets.ISO_8859_1);
System.out.println (new String (c,StandardCharsets.UTF_8));

In Java, this is not possible with Octals, only with Hexa.

This works fine:

System.out.println("\u2019");

It is probably for purely historical reasons that Java supports octal escape sequences at all. These escape sequences originated in C (or maybe in C's predecessors B and BCPL), in the days when computers like the PDP-7 ruled the Earth, and much programming was done in assembly or directly in machine code, and octal was the preferred number base for writing instruction codes, and there was no Unicode, just ASCII, so three octal digits were sufficient to represent the entire character set.

By the time Unicode and Java came along, octal had pretty much given way to hexadecimal as the preferred number base when decimal just wouldn't do. So Java has its \\u escape sequence that takes hexadecimal digits. The octal escape sequence was probably supported just to make C programmers comfortable, and to make it easy to copy'n'paste string constants from C programs into Java programs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM