简体   繁体   中英

Read and Write UTF-8 characters from System.in stream

if i print unicode String like ελληνικά on the console using the print method of System.out stream, its printed as expected (As i use Ubuntu mono in my output console which supports UTF characters).

But if i try to read from the console unicode characters with UTF-8 encoding using System.in stream, it doesn't read properly. I have tried many different ways to achieve it using various reader classes with the System.in stream but it never works. So does anyone know a way i could do that

Here is a sample of code

BufferedReader keyboard = new BufferedReader(new InputStreamReader(System.in, "UTF-8"));
BufferedWriter console = new BufferedWriter(new OutputStreamWriter(System.out, "UTF-8"));

console.write("p1: Γίνεται πάντως\n");
console.flush();
System.out.println("p2: Γίνεται πάντως");

byte dataBytes[] = keyboard.readLine().getBytes(Charset.forName("UTF-8"));
System.out.println("p3: " + new String(dataBytes));
console.write("p4: " + new String(dataBytes, "UTF-8") + "\n");
console.flush();
Scanner scan = new Scanner(System.in, "UTF-8");

System.out.println("p5: " + (char) System.in.read());
System.out.println("p6: " + scan.nextLine());
System.out.println("p7: " + keyboard.readLine());

and the output on my console:

p1: Γίνεται πάντως
p2: Γίνεται πάντως
Δέν
p3: ���
p4: ���
Δέν
p5: Ä
p6: ��
Δέν
p7: ���

my IDE is Netbeans

System.in is an InputStream , which is a stream of bytes. You need a Reader to read characters. The reader is going to do the decoding for you.

In this case, you can wrap System.in with a InputStreamReader , passing "UTF-8" as the second constructor parameter.

Scanner console = new Scanner(new InputStreamReader(System.in, "UTF-8"));
while (console.hasNextLine())
    System.out.println(console.nextLine());

Update:

It's likely the encoding of your stdin is wrong. To verify, you can compare the byte array you get from System.in and the expected.

byte [] expected = "Δέν".getBytes("UTF-8"); // [-50, -108, -50, -83, -50, -67]

byte [] fromStdin = new byte[1024];
int c = System.in.read(fromStdin);
for (int i = 0; i < c-1; i++) {
    if (expected[i] != fromStdin[i]) {
        System.out.println(i + ", " + fromStdin[i]);
    }
}

And you input "Δέν" (without double quotes) then hit enter. If it outputs anything, your System.in is in wrong encoding.

Shouldn't System.in have the same encoding as defaultCharset or some system property?

Not necessarily. It's a byte stream, not a character stream. It cannot be a character stream, because you can/should be able to feed it binary data. An image or audio or vedio, whatever you want. It must support those. That's why it's just an InputStream . It depends on what the environment gave your program. And I know very little about your environment. You need to find out how to change your environment, or figure out what encoding it's actually giving your program.

For example we have an UTF-16 text file utf16.txt , and we feed its content to our program who expects the STDIN to be UTF-8 encoded text:

java -cp ... our.utf8.Program < utf16.txt

It's going to read gibberish.

Try using java.io.Console.readLine() or java.io.Console.readLine(String, Object...) . Console instance is returned by System.console() method. For example:

package package01;

import java.io.Console;

public class Example {

    public static void main(String[] args) {
        Console console = System.console();
        if (console == null) {
            System.err.println("No console");
            System.exit(1);
        }
        String s = console.readLine("Enter string: ");
        System.out.println(s);
    }

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM