简体   繁体   中英

The difference between InputStream and InputStreamReader when reading multi-byte characters

The difference between InputStream and InputStreamReader is that InputStream reads as byte , while InputStreamReader reads as char . For example, if the text in a file is abc ,then both of them work fine. But if the text is a你们 , which is composed of an a and two Chinese characters, then the InputStream does not work.

So we should use InputStreamReader , but my question is:

How does InputStreamReader recognize characters?

a is one byte, but a Chinese character is two bytes. Does it read a as one byte and recognize the other of characters as two bytes, or for every character in this text, does the InputStreamReader read it as two bytes?

An InputStream reads raw octet (8 bit) data. In Java, the byte type is equivalent to the char type in C. In C, this type can be used to represent character data or binary data. In Java, the char type shares greater similarities with the C wchar_t type.

An InputStreamReader then will transform data from some encoding into UTF-16. If "a你们" is encoded as UTF-8 on disk, it will be the byte sequence 61 E4 BD A0 E4 BB AC . When you pass the InputStream to InputStreamReader with the UTF-8 encoding, it will be read as the char sequence 0061 4F60 4EEC .

The character encoding API in Java contains the algorithms to perform this transformation. You can find a list of encodings supported by the Oracle JRE here . The ICU project is a good place to start if you want to understand the internals of how this works in practice.

As Alexander Pogrebnyak points out , you should almost always provide the encoding explicitly. byte -to- char methods that do not specify an encoding rely on the JRE default , which is dependent on operating systems and user settings.

You have to give reader a hint, by providing a character set that your binary file is written in. Eg

Reader reader =
   new InputStreamReader(
       new FileInputStream( "/path/to/file" ),
       "UTF-8" // most likely that the encoding of the file
   )

Without a hint it will use your platform default encoding, which in many cases is not what you want.

This link has a nice explanation of encodings: http://www.joelonsoftware.com/articles/Unicode.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM