简体   繁体   中英

Reverse Parse Multi-Byte

I want to determine whether the last character in the buffer defined as the bytes between begin and end is English or Japanese. I read about uTF-8 where Japanese characters are two bytes long and always have 1 in the high bit of the high byte, whereas low byte can have either 1 or 0 in the high bit.

I am trying to return integer 2 for Japanese(2Bytes), 1 for English and 0 for data in buffer is malformed.

public static int NumChars(byte begin, byte end). Can you point me to the right direction? I am confused how to approach this. I was thinking about using xor to find if the MSB in high bit is 1 then return 2, but I have a doubt even if I understood correctly.

Jeevan UTF-8 character byte length can be between 1 to 4 bytes.

so If you want to print 2 for Japanese characters please use this unicode.

SJIS

Example:--

String j = "大";     
System.out.println(j.getBytes("SJIS").length);

There is a discussion about this on this thread guessing-the-encoding-of-text-represented-as-byte-in-java

If you can get the buffer or part of it in string form. Then you can use regular expressions to match the character sets like this:

   String english = ".*[\\x{20}-\\x{7E}]$";
   String hiragana = ".*[\\x{3041}-\\x{3096}]$";
   
   byte[] buffer = {97, 98, 99, -29, -127, -126}; //"abcあ"
   System.out.println("buffer: "+Arrays.toString(buffer));
   String s = new String(buffer,"utf-8") ;

   System.out.println(s + " is hiragana=" + s.matches(hiragana));
   System.out.println(s + " is english=" + s.matches(english));

   s = "abcd";
   System.out.println(s + " is hiragana=" + s.matches(hiragana));
   System.out.println(s + " is english=" + s.matches(english));

Output:

buffer: [97, 98, 99, -29, -127, -126]
abcあ is hiragana=true
abcあ is english=false
abcd is hiragana=false
abcd is english=true

You will have to find out which Japanese character sets your program is using like Kenji, Hiragana, Katakana etc. For more information read this article: regular-expressions-for-japanese-text

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM