[英]Character strings to binary string - why are some characters multi-byte?
This code is supposed to convert a character strings to binary ones, but with a few strings, it returns a String
with 16 binary digits, not 8 as I expected them to be. 该代码应该将字符串转换为二进制字符串,但是如果有一些字符串,它将返回一个字符串,该
String
包含16个二进制数字,而不是我期望的8位数字。
public class aaa {
public static void main(String argv[]){
String nux="ª";
String nux2="Ø";
String nux3="(";
byte []bites = nux.getBytes();
byte []bites2 = nux2.getBytes();
byte []bites3 = nux3.getBytes();
System.out.println(AsciiToBinary(nux));
System.out.println(AsciiToBinary(nux2));
System.out.println(AsciiToBinary(nux3));
System.out.println("number of bytes :"+bites.length);
System.out.println("number of bytes :"+bites2.length);
System.out.println("number of bytes :"+bites3.length);
}
public static String AsciiToBinary(String asciiString){
byte[] bytes = asciiString.getBytes();
StringBuilder binary = new StringBuilder();
for (byte b : bytes)
{
int val = b;
for (int i = 0; i < 8; i++)
{
binary.append((val & 128) == 0 ? 0 : 1);
val <<= 1;
}
binary.append(' ');
}
return binary.toString();
}
}
in the first two strings, I don't understand why they return 2 bytes, since they are single-character strings. 在前两个字符串中,我不明白为什么它们返回2个字节,因为它们是单字符字符串。
Compiled here to: https://ideone.com/AbxBZ9 编译到这里: https : //ideone.com/AbxBZ9
This returns: 返回:
11000010 10101010
11000011 10011000
00101000
number of bytes :2
number of bytes :2
number of bytes :1
I am using this code: Convert A String (like testing123) To Binary In Java 我正在使用此代码: 在Java中将字符串(如testing123)转换为二进制
NetBeans IDE 8.1 NetBeans IDE 8.1
A character is not always 1-byte long. 字符并不总是1字节长。 Think about it - many languages, such as Chinese or Japanese, have thousands of characters, how would you map those characters to bytes?
考虑一下-许多语言(例如中文或日文)都有成千上万个字符,您如何将这些字符映射到字节?
You are using UTF-8 (one of the many, many ways of mapping characters to bytes) - looking up a character table for UTF-8, and searching for the sequence 11000010 10101010
, I arrive at 您正在使用UTF-8 (将字符映射到字节的多种方法之一)-查找UTF-8的字符表,并搜索序列
11000010 10101010
,我得到了
U+00AA ª 11000010 10101010
Which is the UTF-8 encoding for ª
. ª
的UTF-8编码。 UTF-8 is often the default character encoding (charset) for Java -- but you cannot rely on this. UTF-8通常是Java的默认字符编码(字符集)-但您不能依靠它。 That is why you should always specify a charset when converting strings to bytes or vice-versa
这就是为什么在将字符串转换为字节时应始终指定字符集的原因 , 反之亦然
you can understand why some character are two bytes by running this simple code 您可以通过运行以下简单代码来理解为什么某些字符是两个字节
// integer - binary
System.out.println(Byte.MIN_VALUE);
// -128 - 0b11111111111111111111111110000000
System.out.println(Byte.MAX_VALUE);
// 127 - 0b1111111
System.out.println((int) Character.MIN_VALUE);
// 0 - 0b0
System.out.println((int) Character.MAX_VALUE);
// 65535 - 0b1111111111111111
as you can see ,we can show Byte.MAX_VALUE
with just 7 bits
or 1 byte (01111111)
如您所见,我们可以仅显示
7 bits
或1 byte (01111111)
来显示Byte.MAX_VALUE
1 byte (01111111)
if you cast Character.MIN_VALUE
to integer, it will be : 0
如果将
Character.MIN_VALUE
为整数,则将为: 0
we can show it's binary format with one bit
or 1 byte (00000000)
! 我们可以用
one bit
或1 byte (00000000)
来显示它的二进制格式!
but what about Character.MAX_VALUE
? 但是
Character.MAX_VALUE
呢?
in binary format it's 1111111111111111
which is 65535
in decimal format 二进制格式是
1111111111111111
,十进制格式是65535
and can be shown with 2 bytes (11111111 11111111)
. 并且可以显示为
2 bytes (11111111 11111111)
。
so characters which their decimal format is between 0 and 65535
can be shown with 1 or 2 bytes
. 因此十进制格式在
0 and 65535
之间的字符可以用1 or 2 bytes
。
hope you understand. 希望你能理解。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.