简体   繁体   中英

How can I make System.in Input Stream read utf-8 characters?

This is my code:

public class MyTestClass {
    public static void main(String[] args) throws Exception {
        Scanner scanner = new Scanner(System.in);
        String s = scanner.nextLine();
        InputStream inputStream = System.in;
        int read = inputStream.read();
        System.out.println(read);
        System.out.println((char)read);
        System.out.println(s);
    }
}

And I input the letter ğ twice when I run the program. The console output will be:

ğ
ğ
196
Ä
ğ

How can I see the correct letter instead of Ä ? Scanner seems to do the right thing.

And actually, why does not this approach work? What is wrong in here?

The javadoc for InputStream#read() states

Reads the next byte of data from the input stream.

But as it turns out, the character ğ requires 2 bytes for representation in UTF-8. You therefore need to read two bytes. You can use InputStream#read(byte[]) .

byte[] buffer = new byte[2];
inputStream.read(buffer);

Once the byte array contains the appropriate bytes, you need to decode them in UTF-8. You can do that with

char val = StandardCharsets.UTF_8.decode(ByteBuffer.wrap(buffer)).get();

The variable val will now contain the decoded character.

Note that some UTF-8 encoded character only need one byte for representation, so you should only do what we just did if you know how many bytes you need. Otherwise, read everything and pass it to the decoder.

InputStream.read() retruns the next byte of data, which is a number between 0 and 255.

Here, you are simply converting that byte into char , which in your case gives Ä .

Scanner on the other hand, reads the whole string and that's why you see it properly output. I suggest you use Scanner over plain InputStream since it offers convenient methods for reading texts.

Wrap the InputStream in an InputStreamReader .

int read = new InputStreamReader(System.in).read();
System.out.println((char) read); // prints 'ğ'

If necessary, you can pass a specific Charset to the reader's constructor, but by default, it will just use the default charset, which is probably correct.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM