How can I read Chinese characters correctly using Scanner in Java?

Question

Programming language: Java Task: designing a hash function that maps Chinese Strings to numbers Problem: correct reading and displaying of Chinese characters

This is a homework question, but I'm not asking how to do it, just having trouble implementing the reading of Chinese characters.

A short description of my task: to design a hash function to map (Chinese) students' names in our class to their student IDs, and other satellite data (gender, phone and the like).

I'm still thinking about it, but just like other languages, the scope of this involves me using the character encoding of a character to, via the hash function, come up with a unique value, if I'm not mistaken.

Here's what I have to test the validity of this train of thought:

// test whether console can read chinese characters
Scanner s = new Scanner(System.in);

System.out.print("Please enter a Chinese character: ");
int chi = (int)s.next().toCharArray()[0];

System.out.println("\nThe string entered is " + chi);

If I use a simple System.out.println("character") statement, the correct character is displayed.

But as seen above, if I use Scanner to read input, I've tried to convert the String into a char array then to its int unicode equivalent, but it comes up with a ridiculous number, and I can't display it correctly.

I realize I can just use this erroneous value to design a hash function, but for the sake of not creating possible collisions (I don't know if these produce UNIQUE erroneous values), and for the sake of learning, could you point out how I might unify input of chinese characters across different machines?

Always grateful for your thoughts. :D

Baggio.

Answer 1

When you create a Scanner, you can also tell it which character encoding to use. Here is the documentation.

Answer 2

When you are not using basic ASCII characters, you need to consider which character set you are using. Most often it will be UTF-8 but other character sets can be used as well.

One thing to keep in mind is that the size of a non-ASCII character can exceed 1 byte. This is true of Chinese characters.

When dealing with multibyte characters, you will need to think in terms of codepoints (which is the integer representing the UTF-8 character) instead of single-byte characters.

Newer versions of Java allow you to iterate over a String using codepoints. Look at the Java API for String.

Answer 3

You are over-thinking this. Every String is already (conceptually) a sequence of characters, including Chinese characters.. Encoding only comes into it when you need to convert it into a bytes, which you don't need to for your assignment. Just use the String 's hashcode. In fact, when you create a HashMap<String,YourObject> , that's exactly what will happen behind the scenes.

How can I read Chinese characters correctly using Scanner in Java?

Question

3 answers

solution1
3 2012-10-15 14:52:08

solution2
3 2012-10-15 15:13:48

solution3
1 ACCPTED 2012-10-15 15:46:14

How can I read Chinese characters correctly using Scanner in Java?

Question

3 answers

solution1 3 2012-10-15 14:52:08

solution2 3 2012-10-15 15:13:48

solution3 1 ACCPTED 2012-10-15 15:46:14

solution1
3 2012-10-15 14:52:08

solution2
3 2012-10-15 15:13:48

solution3
1 ACCPTED 2012-10-15 15:46:14