简体   繁体   中英

Java: replace missing Unicode symbols in a string?

I have a rather straightforward question. When I read a string from a stream, all of the letters are fine except symbols. For example, if I tried to read a username that has the ™ or the © symbol in it, the symbols print out as: â„¢ and ©, respectively. I thought that Java supported all of the Unicode characters. How can I get the symbols to be printed out correctly?

Is there a special type of string that I could use, or perhaps another solution to this problem?

When reading from a stream, eg using

InputStreamReader reader = new InputStreamReader(stream);

You tell java to use the platform encoding. This may not (in fact at least 50% of the time given how often windows pcs appear) be a Unicode encoding

You need to specify the encoding of the byte stream, eg

InputStreamReader reader = new InputStreamReader(stream, charset);

Or

InputStreamReader reader = new InputStreamReader(stream, "UTF-8");

If using the charset name rather than a Charset instance

Based on the character examples you are giving, I believe you are reading in the characters correctly. For example, the copyright character is Unicode A9. When you write it out in UTF-8 however, it will be serialized as 2 bytes: C2 followed by A9. See http://www.fileformat.info/info/unicode/char/a9/index.htm

If your output device expects data in UTF-8 format all will be well. However since you are seeing ©, I believe your output device expects data in ISO-8859-1 (see http://en.wikipedia.org/wiki/ISO/IEC_8859-1 ) so you have a mismatch. The output device interprets the C2 as  and the A9 as ©.

To fix this in code (without changing your output device) you need to create an print stream that will use the ISO-8859-1 character encoding when it converts your Unicode characters to a byte stream. For example:

public static void main (String [] args) throws Exception
{
    // use default character encoding
    String s = "copyright is ©";
    System.out.println(s);

    // create a new stream with a different encoding
    PrintStream out = new PrintStream(System.out, true, "ISO-8859-1");
    out.println(s);
}

In my case the first println looks good because the IDE console window has UTF-8 encoding and the second one looks bogus. In your case the first line should be bad (showing two characters where the copyright symbol should be) and the second one should show the correct copyright character.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM