简体   繁体   中英

Java Scanner reading UTF-8 lambda character as 0

I am trying to input lambda functions from a Java reader, but the lambda character is reading as byte 0, and is printing out as an empty string. I have tried changing the scanner to specifically be UTF-8 and changed the terminal encoding, but nothing changed. I am using VS Code.

import java.util.*;
public class App {
    public static void main (String[] args) throws Exception {
            Scanner in = new Scanner(System.in, "UTF-8");

            System.out.print("> ");
            //input (λa.a)
            String cmd = in.nextLine();

            byte[] cmdBytes = cmd.getBytes("UTF-8");

            for (int i = 0; i < cmdBytes.length; i++) {
                System.out.println((int)cmdBytes[i] + "\"" + cmd.charAt(i) + "\"");
            }
            /*outputs
            40"("
            0" "
            97"a"
            46"."
            97"a"
            41")"
            */
    }
}

You can get Lambda to be printed out to the console, but you will need a couple of changes to your Java code.

Here is my Java code:

import java.util.Scanner;

public class ScannerLambda {

    public static void main(String[] args) throws Exception {

        Scanner in = new Scanner(System.in, "UTF-8");

        System.out.print("> ");
        //input (λa.a)
        String cmd = in.nextLine();

        System.out.println(cmd);

        // Use chars and not bytes, because lambda has 2 bytes in UTF-8
        char[] cmdchars = cmd.toCharArray();

        for (int i = 0; i < cmdchars.length; i++) {
            System.out.println((int) cmdchars[i] + "\"" + cmd.charAt(i) + "\"");
        }
    }
}

Then you will need to start the programme with this JVM option:

-Dfile.encoding=UTF-8

This ensures that the console will be able to print UTF-8 characters correctly. This is especially important if you are using Windows as the default character set is not UTF-8.

This is the output that I am getting with the solution presented here:

> λa.a
λa.a
955"λ"
97"a"
46"."
97"a"

That happens because your input terminal does not support UTF-8 or the input format is not UTF-8, so the lambda gets mapped to 0. Use a terminal that supports UTF-8.

Even so, keep in mind that some UTF-8 characters like λ will take two bytes, so your for loop will be broken from that point (it will print the second byte of the lambda near the "a", the byte of the "a" near the ".", and so on, and in the end, you will have an exception because you will try to access .charAt(6) because the length of cmdBytes is 7):

> (λa.a)
0: 40 "("
1: -50 "λ"
2: -69 "a"
3: 97 "."
4: 46 "a"
5: 97 ")"
Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 6
    at java.lang.String.charAt(String.java:658)
    at App.main(App.java:14)

Changing byte[] cmdBytes = cmd.getBytes("UTF-8"); for char[] cmdBytes = cmd.toCharArray(); should do the job. Just keep in mind that the char 'λ' will still occupy two bytes.

> (λa.a)
0: 40 "("
1: 955 "λ"
2: 97 "a"
3: 46 "."
4: 97 "a"
5: 41 ")"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM