简体   繁体   中英

Understanding encoding in character streams

I'm trying to understand how encodings are applied by character streams in Java. For the discussion let's use the following code example:

public static void main(String[] args) throws Exception {    
    byte[] utf8Input = new byte[] { (byte) 0xc3, (byte) 0xb6 }; // 'ö'
    ByteArrayOutputStream utf160Out = new ByteArrayOutputStream();

    InputStreamReader is = new InputStreamReader(new ByteArrayInputStream(utf8Input), StandardCharsets.UTF_8); // [
    OutputStreamWriter os = new OutputStreamWriter(utf160Out, StandardCharsets.UTF_16);

    int len;
    while ((len = is.read()) != -1) {
      os.write(len);
    }
    os.close();
  }

The program reads the UTF-8 encoded character 'ö' from the byte array utf8Input and writes it UTF-16 encoded to utf160Out . In particular, the ByteArrayInputStream on utf8Input just streams the bytes 'as-is' and the InputStreamReader subsequently decodes the read input with an UTF-8 decoder. Dumping the result of the len variable yields '0xf6' which represents the Unicode code point for 'ö'. The OutputStreamWriter writes using UTF-16 encoding without having any knowledge about the input encoding.

How does the OutputStreamWriter know the input encoding (here: UTF-8)? Is there an internal representation that is assumed which is also mapped to by an InputStreamReader? So basically, we are saying then: Read this input, it is UTF-8 encoded and decode it to our internal encoding X . An OutputStreamWriter is given the target encoding and expects the input to be encoded with X . Is this correct? If so, what is the internal encoding? UTF-16 as mentioned in What is the Java's internal represention for String? Modified UTF-8? UTF-16? ?

The read() method has returned a Java char value, which is an unsigned 2-byte binary number (0-65535).

The actual return type is int (signed 4-byte binary number) to allow for a special -1 value meaning end-of-stream.

A Java char is a UTF-16 encoded Unicode character. This means that all characters from the Basic Multilingual Plane will appear unencoded, ie the char value is the Unicode value.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM