简体   繁体   English

读取多字节字符时InputStream和InputStreamReader之间的区别

[英]The difference between InputStream and InputStreamReader when reading multi-byte characters

The difference between InputStream and InputStreamReader is that InputStream reads as byte , while InputStreamReader reads as char . 之间的差InputStreamInputStreamReader是, InputStream读取, byte ,而InputStreamReader读取作为char For example, if the text in a file is abc ,then both of them work fine. 例如,如果文件中的文本是abc ,那么它们都可以正常工作。 But if the text is a你们 , which is composed of an a and two Chinese characters, then the InputStream does not work. 但是如果文本是a你们a和两个中文字符组成的a你们 ,则InputStream不起作用。

So we should use InputStreamReader , but my question is: 所以我们应该使用InputStreamReader ,但我的问题是:

How does InputStreamReader recognize characters? InputStreamReader如何识别字符?

a is one byte, but a Chinese character is two bytes. a是一个字节,但中文字符是两个字节。 Does it read a as one byte and recognize the other of characters as two bytes, or for every character in this text, does the InputStreamReader read it as two bytes? 它是否读取a为一个字节,并承认其他字符为两个字节,或对该文本的每个字符,并在InputStreamReader它读成两个字节?

An InputStream reads raw octet (8 bit) data. InputStream读取原始八位位组(8位)数据。 In Java, the byte type is equivalent to the char type in C. In C, this type can be used to represent character data or binary data. 在Java中, byte类型等同于C中的char类型。在C中,此类型可用于表示字符数据或二进制数据。 In Java, the char type shares greater similarities with the C wchar_t type. 在Java中, char类型与C wchar_t类型具有更大的相似性。

An InputStreamReader then will transform data from some encoding into UTF-16. 然后, InputStreamReader将数据从某些编码转换为UTF-16。 If "a你们" is encoded as UTF-8 on disk, it will be the byte sequence 61 E4 BD A0 E4 BB AC . 如果“a你们”在磁盘上编码为UTF-8,则它将是字节序列61 E4 BD A0 E4 BB AC When you pass the InputStream to InputStreamReader with the UTF-8 encoding, it will be read as the char sequence 0061 4F60 4EEC . 当您通过InputStreamInputStreamReader使用UTF-8编码,这将被解读为炭序列0061 4F60 4EEC

The character encoding API in Java contains the algorithms to perform this transformation. Java中的字符编码API包含执行此转换的算法。 You can find a list of encodings supported by the Oracle JRE here . 您可以在此处找到Oracle JRE支持的编码列表。 The ICU project is a good place to start if you want to understand the internals of how this works in practice. 如果您想了解其在实践中如何运作的内部, ICU项目是一个很好的起点。

As Alexander Pogrebnyak points out , you should almost always provide the encoding explicitly. 正如Alexander Pogrebnyak所指出的那样 ,你应该几乎总是明确地提供编码。 byte -to- char methods that do not specify an encoding rely on the JRE default , which is dependent on operating systems and user settings. 不指定编码的byte to- char方法依赖于JRE默认值 ,这取决于操作系统和用户设置。

You have to give reader a hint, by providing a character set that your binary file is written in. Eg 您必须通过提供写入二进制文件的字符集给读者提示。例如

Reader reader =
   new InputStreamReader(
       new FileInputStream( "/path/to/file" ),
       "UTF-8" // most likely that the encoding of the file
   )

Without a hint it will use your platform default encoding, which in many cases is not what you want. 如果没有提示,它将使用您的平台默认编码,在许多情况下,这不是您想要的。

This link has a nice explanation of encodings: http://www.joelonsoftware.com/articles/Unicode.html 此链接对编码有一个很好的解释: http//www.joelonsoftware.com/articles/Unicode.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM