Understanding encoding in character streams

Question

I'm trying to understand how encodings are applied by character streams in Java. For the discussion let's use the following code example:

public static void main(String[] args) throws Exception {    
    byte[] utf8Input = new byte[] { (byte) 0xc3, (byte) 0xb6 }; // 'ö'
    ByteArrayOutputStream utf160Out = new ByteArrayOutputStream();

    InputStreamReader is = new InputStreamReader(new ByteArrayInputStream(utf8Input), StandardCharsets.UTF_8); // [
    OutputStreamWriter os = new OutputStreamWriter(utf160Out, StandardCharsets.UTF_16);

    int len;
    while ((len = is.read()) != -1) {
      os.write(len);
    }
    os.close();
  }

The program reads the UTF-8 encoded character 'ö' from the byte array utf8Input and writes it UTF-16 encoded to utf160Out . In particular, the ByteArrayInputStream on utf8Input just streams the bytes 'as-is' and the InputStreamReader subsequently decodes the read input with an UTF-8 decoder. Dumping the result of the len variable yields '0xf6' which represents the Unicode code point for 'ö'. The OutputStreamWriter writes using UTF-16 encoding without having any knowledge about the input encoding.

How does the OutputStreamWriter know the input encoding (here: UTF-8)? Is there an internal representation that is assumed which is also mapped to by an InputStreamReader? So basically, we are saying then: Read this input, it is UTF-8 encoded and decode it to our internal encoding X . An OutputStreamWriter is given the target encoding and expects the input to be encoded with X . Is this correct? If so, what is the internal encoding? UTF-16 as mentioned in What is the Java's internal represention for String? Modified UTF-8? UTF-16? ?

Answer 1

The read() method has returned a Java char value, which is an unsigned 2-byte binary number (0-65535).

The actual return type is int (signed 4-byte binary number) to allow for a special -1 value meaning end-of-stream.

A Java char is a UTF-16 encoded Unicode character. This means that all characters from the Basic Multilingual Plane will appear unencoded, ie the char value is the Unicode value.

Understanding encoding in character streams

Question

1 answers

solution1
1 ACCPTED 2019-08-10 11:10:08

Understanding encoding in character streams

Question

1 answers

solution1 1 ACCPTED 2019-08-10 11:10:08

solution1
1 ACCPTED 2019-08-10 11:10:08