简体   繁体   中英

Java Unicode Confusion

HEy all, I have only just started attempting to learn Java and have run into something that is really confusing!

I was typing out an example from the book I am using. It is to demonstrate the char data type.

The code is as follows :

public class CharDemo
{
public static void main(String [] args)
{
char a = 'A';
char b = (char) (a + 1);
System.out.println(a + b);
System.out.println("a + b is " + a + b);
int x = 75;
char y = (char) x;
char half = '\u00AB';
System.out.println("y is " + y + " and half is " + half);
}
}

The bit that is confusing me is the statement, char half = '\«'. The book states that \« is the code for the symbol '1/2'. As described, when I compile and run the program from cmd the symbol that is produced on this line is in fact a '1/2'.

So everything appears to be working as it should. I decided to play around with the code and try some different unicodes. I googled multiple unicode tables and found none of them to be consistent with the above result.

In every one I found it stated that the code /u00AB was not for '1/2' and was in fact for this:

http://www.fileformat.info/info/unic...r/ab/index.htm So what character set is Java using, I thought UNicode was supposed to be just that, Uni, only one. I have searched for hours and nowhere can I find a character set that states /u00AB is equal to a 1/2, yet this is what my java compiler interprets it as.

I must be missing something obvious here! Thanks for any help!

It's a well-known problem with console encoding mismatch on Windows platforms.

Java Runtime expects that encoding used by the system console is the same as the system default encoding. However, Windows uses two separate encodings: ANSI code page (system default encoding) and OEM code page (console encoding) .

So, when you try to write Unicode character U+00AB LEFT-POINTING DOUBLE ANGLE QUOTATION MARK to the console, Java runtime expects that console encoding is the ANSI encoding (that is Windows-1252 in your case), where this Unicode character is represented as 0xAB . However, the actual console encoding is the OEM encoding ( CP437 in your case), where 0xAB means ½ .

Therefore printing data to Windows console with System.out.println() produces wrong results.

To get correct results you can use System.console().writer().println() instead.

The character is not the 1/2 character; see this definitive code page from the Unicode.org website.

What you are seeing is (I think) a consequence of using the System.out PrintStream on a platform where the default character encoding is not UTF-8 or Latin-1. Maybe it is some Windows character set as suggested by @axtavt's answer? (It also has a plausible explanation of why is displayed as 1/2 ... and not some "splat" character.)

(In Unicode and Latin-1, \\00BD is the codepoint for the 1/2 character.)

0xAB is 1/2 in good old Codepage 437 , which is what Windows terminals will use by default, no matter what codepage you actually set .

So, in fact, the char value represents the "«" character to a Java program, and if you render that char in a GUI or run it on a sane operating system, you will get that character. If you want to see proper output in Windows as well, switch your Font settings in CMD away from "Raster Fonts" (click top-left icon, Properties, Font tab). For example, with Lucida Console, I can do this:

C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is ½    

C:\Users\Documents>chcp 1252
Active code page: 1252

C:\Users\Documents>java CharDemo
131
a + b is AB
y is K and half is «

C:\Users\Documents>chcp 437
Active code page: 437

One thing great about Java is that it is unicode based. That means, you can use characters from writing systems that are not english alphabets (eg Chinese or math symbols), not just in data strings, but in function and variable names too.

Here's a example code using unicode characters in class names and variable names.

class 方 {
    String 北 = "north";
    double π = 3.14159;
}

class UnicodeTest {
    public static void main(String[] arg) {
        方 x1 = new 方();
        System.out.println( x1.北 );
        System.out.println( x1.π );
    }
}

Java was created around the time when the Unicode standard had values defined for a much smaller set of characters. Back then it was felt that 16-bits would be more than enough to encode all the characters that would ever be needed. With that in mind Java was designed to use UTF-16. In fact, the char data type was originally used to be able to represent a 16-bit Unicode code point.

The UTF-8 charset is specified by RFC 2279;

The UTF-16 charsets are specified by RFC 2781

The UTF-16 charsets use sixteen-bit quantities and are therefore sensitive to byte order. In these encodings the byte order of a stream may be indicated by an initial byte-order mark represented by the Unicode character '\'. Byte-order marks are handled as follows:

When decoding, the UTF-16BE and UTF-16LE charsets ignore byte-order marks; when encoding, they do not write byte-order marks.

When decoding, the UTF-16 charset interprets a byte-order mark to indicate the byte order of the stream but defaults to big-endian if there is no byte-order mark; when encoding, it uses big-endian byte order and writes a big-endian byte-order mark.

Also see this

Well, when I use that code I get the << as I should and 1/2 for as it should be.

http://www.unicode.org/charts/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM