简体   繁体   中英

Java Byte to Char conversion

I read from a TCP/IP socket s:

byte[] bbuf = new byte[30];
s.getInputStream().read(bbuf);
for (int i = 0; i < bbuf.length; i++)
{
     System.out.println(Integer.toHexString( (int) (bbuf[i] & 0xff)));
}

This outputs CA 68 9F 75 which is what I would expect. Now I want to use chars instead

char[] cbuf = new char[30];
BufferedReader input =  new BufferedReader(new InputStreamReader(s.getInputStream())); 
for (int i = 0; i < cbuf.length; i++)
{
     System.out.println(Integer.toHexString( (int) (cbuf[i] )));
}

Now the output is CA 68 178 75. So the third Byte (and only the third byte) makes the difference. I assume it has to do with the character sets and that I have to specify a character set in the InputStreamer. I have no idea how to find out what character set I have to use. Secondly I am surprised if it is due to character sets that I only get the mess with exactly one character. I tried all kind of other characters but that seems to be the only one I was able to find.

Who can solve the mystery?

Your problem is that you are comparing pears with apples; bytes are not the same as characters. In your code, the character Ÿ is represented in the following ways:

  • 9F ( byte encoded using Windows-1252)
  • 178 ( char encoded using UTF-16, which is the encoding Java always uses for chars internally)

As a proof of what I'm saying, check this:

String myString = "Caña";
byte[] bbuf = myString.getBytes();     // [ 43, 61, C3, B1, 61 ]   (UTF-8 on my machine)
char[] cbuf = myString.toCharArray();  // [ 43, 61, F1, 61 ]  (Java uses UTF-16 internally)

Now an analysis of your problem:

  • You took a byte array from a String, I guess by doing this: myString.getBytes() as you didn't specify an encoding, the system is using the default in your machine (Windows-1252)

  • When you read your bytes into a String using InputSteanReader, etc. there is actually not a problem because you are reading from another (or the same) Windows machine, the problem is when you get the array of chars (instead of an array of bytes) expecting to have the same result (use myString.getBytes() instead of myString.toCharArray() and you'll see your bytes correctly).

Finally, some advice:

  • Always declare explictly the encoding when you convert between Strings and byte arrays:

     byte[] bbuf = myString.getBytes(Charset.forName("UTF-8")); String myString = new String(bbuf, Charset.forName("UTF-8"));
  • Never mix chars and bytes, they are not the same

InputStreamReader is going to convert the bytes from the input stream to characters using a character encoding . Since you didn't specify explicitly what character encoding should be used, it's going to use the default character encoding of your system.

How the bytes are converted depends on what character encoding is being used.

If the data is binary data and does not represent text encoded with some character encoding, then using InputStreamReader is the wrong way to read this data.

See also: Streams and readers/writers

I don't know if there are any side effects here, but I do this:

buf = new String(buffer, StandardCharsets.ISO_8859_1).toCharArray();

Where "buffer" is a byte array I get from reading from a GZIPInputStream. This is just an expansion on Morgano's explanation above.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM