简体   繁体   English

反向解析多字节

[英]Reverse Parse Multi-Byte

I want to determine whether the last character in the buffer defined as the bytes between begin and end is English or Japanese.我想确定缓冲区中定义为开始和结束之间字节的最后一个字符是英文还是日文。 I read about uTF-8 where Japanese characters are two bytes long and always have 1 in the high bit of the high byte, whereas low byte can have either 1 or 0 in the high bit.我读到了 uTF-8 ,其中日文字符有两个字节长,并且高字节的高位总是有 1,而低字节的高位可以有 1 或 0。

I am trying to return integer 2 for Japanese(2Bytes), 1 for English and 0 for data in buffer is malformed.我正在尝试返回 integer 2 表示日语(2 字节),1 表示英语,0 表示缓冲区中的数据格式不正确。

public static int NumChars(byte begin, byte end).公共 static int NumChars(字节开始,字节结束)。 Can you point me to the right direction?你能指出我正确的方向吗? I am confused how to approach this.我很困惑如何解决这个问题。 I was thinking about using xor to find if the MSB in high bit is 1 then return 2, but I have a doubt even if I understood correctly.我正在考虑使用 xor 来查找高位的 MSB 是否为 1 然后返回 2,但即使我理解正确,我也有疑问。

Jeevan UTF-8 character byte length can be between 1 to 4 bytes. Jeevan UTF-8 字符字节长度可以在 1 到 4 个字节之间。

so If you want to print 2 for Japanese characters please use this unicode.所以如果你想为日文字符打印 2,请使用这个 unicode。

SJIS SJIS

Example:--例子: -

String j = "大";     
System.out.println(j.getBytes("SJIS").length);

There is a discussion about this on this thread guessing-the-encoding-of-text-represented-as-byte-in-java在这个线程上有一个关于这个的讨论, guesing-the-encoding-of-text-represented-as-byte-in-java

If you can get the buffer or part of it in string form.如果您可以以字符串形式获取缓冲区或其中的一部分。 Then you can use regular expressions to match the character sets like this:然后你可以使用正则表达式来匹配这样的字符集:

   String english = ".*[\\x{20}-\\x{7E}]$";
   String hiragana = ".*[\\x{3041}-\\x{3096}]$";
   
   byte[] buffer = {97, 98, 99, -29, -127, -126}; //"abcあ"
   System.out.println("buffer: "+Arrays.toString(buffer));
   String s = new String(buffer,"utf-8") ;

   System.out.println(s + " is hiragana=" + s.matches(hiragana));
   System.out.println(s + " is english=" + s.matches(english));

   s = "abcd";
   System.out.println(s + " is hiragana=" + s.matches(hiragana));
   System.out.println(s + " is english=" + s.matches(english));

Output: Output:

buffer: [97, 98, 99, -29, -127, -126]
abcあ is hiragana=true
abcあ is english=false
abcd is hiragana=false
abcd is english=true

You will have to find out which Japanese character sets your program is using like Kenji, Hiragana, Katakana etc. For more information read this article: regular-expressions-for-japanese-text您必须找出您的程序使用的日语字符集,例如 Kenji、Hiragana、Katakana 等。有关更多信息,请阅读本文: regular-expressions-for-japanese-text

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM