[英]The difference between InputStream and InputStreamReader when reading multi-byte characters
The difference between InputStream
and InputStreamReader
is that InputStream
reads as byte
, while InputStreamReader
reads as char
. 之间的差
InputStream
和InputStreamReader
是, InputStream
读取, byte
,而InputStreamReader
读取作为char
。 For example, if the text in a file is abc
,then both of them work fine. 例如,如果文件中的文本是
abc
,那么它们都可以正常工作。 But if the text is a你们
, which is composed of an a
and two Chinese characters, then the InputStream
does not work. 但是如果文本是
a你们
由a
和两个中文字符组成的a你们
,则InputStream
不起作用。
So we should use InputStreamReader
, but my question is: 所以我们应该使用
InputStreamReader
,但我的问题是:
How does InputStreamReader
recognize characters? InputStreamReader
如何识别字符?
a
is one byte, but a Chinese character is two bytes. a
是一个字节,但中文字符是两个字节。 Does it read a
as one byte and recognize the other of characters as two bytes, or for every character in this text, does the InputStreamReader
read it as two bytes? 它是否读取
a
为一个字节,并承认其他字符为两个字节,或对该文本的每个字符,并在InputStreamReader
它读成两个字节?
An InputStream
reads raw octet (8 bit) data. InputStream
读取原始八位位组(8位)数据。 In Java, the byte
type is equivalent to the char
type in C. In C, this type can be used to represent character data or binary data. 在Java中,
byte
类型等同于C中的char
类型。在C中,此类型可用于表示字符数据或二进制数据。 In Java, the char
type shares greater similarities with the C wchar_t
type. 在Java中,
char
类型与C wchar_t
类型具有更大的相似性。
An InputStreamReader
then will transform data from some encoding into UTF-16. 然后,
InputStreamReader
将数据从某些编码转换为UTF-16。 If "a你们" is encoded as UTF-8 on disk, it will be the byte sequence 61 E4 BD A0 E4 BB AC
. 如果“a你们”在磁盘上编码为UTF-8,则它将是字节序列
61 E4 BD A0 E4 BB AC
。 When you pass the InputStream
to InputStreamReader
with the UTF-8 encoding, it will be read as the char sequence 0061 4F60 4EEC
. 当您通过
InputStream
来InputStreamReader
使用UTF-8编码,这将被解读为炭序列0061 4F60 4EEC
。
The character encoding API in Java contains the algorithms to perform this transformation. Java中的字符编码API包含执行此转换的算法。 You can find a list of encodings supported by the Oracle JRE here .
您可以在此处找到Oracle JRE支持的编码列表。 The ICU project is a good place to start if you want to understand the internals of how this works in practice.
如果您想了解其在实践中如何运作的内部, ICU项目是一个很好的起点。
As Alexander Pogrebnyak points out , you should almost always provide the encoding explicitly. 正如Alexander Pogrebnyak所指出的那样 ,你应该几乎总是明确地提供编码。
byte
-to- char
methods that do not specify an encoding rely on the JRE default , which is dependent on operating systems and user settings. 不指定编码的
byte
to- char
方法依赖于JRE默认值 ,这取决于操作系统和用户设置。
You have to give reader a hint, by providing a character set that your binary file is written in. Eg 您必须通过提供写入二进制文件的字符集给读者提示。例如
Reader reader =
new InputStreamReader(
new FileInputStream( "/path/to/file" ),
"UTF-8" // most likely that the encoding of the file
)
Without a hint it will use your platform default encoding, which in many cases is not what you want. 如果没有提示,它将使用您的平台默认编码,在许多情况下,这不是您想要的。
This link has a nice explanation of encodings: http://www.joelonsoftware.com/articles/Unicode.html 此链接对编码有一个很好的解释: http : //www.joelonsoftware.com/articles/Unicode.html
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.