简体   繁体   English

使用BOM表字符串处理utf-8时java的行为不一致

[英]java's inconsistent behavior when handling a utf-8 with BOM string

I open my Windows notepad, enter 18 , and save the file as utf-8 encoding. 我打开Windows记事本,输入18 ,然后将文件另存为utf-8编码。 I know that my file will have a BOM header, and my file is a utf-8 encoded file(with a BOM header). 知道我的文件将带有BOM表头,并且我的文件是utf-8编码的文件(带有BOM表头)。

Problem is that, when printing that string by below code: 问题是,通过以下代码打印该字符串时:

//str is that string read from the file using StandardCharsets.UTF_8 encoding
System.out.println(str);

In windows I got: 在Windows中,我得到了:

?18

But in linux I got: 但是在Linux中,我得到了:

18

So why the behavior of java is different? 那么,为什么Java的行为不同? How to understand it? 怎么理解呢?

A BOM is a zero-width space, so invisible in principle. BOM是零宽度的空间,因此原则上不可见。

However Window has no UTF-8 encoding but uses one of the many single byte encodings. 但是,Window没有UTF-8编码,但是使用许多单字节编码之一。 The conversion from String to the output will turn the BOM, missing in the charset, into a question mark. 从字符串到输出的转换会将字符集中缺少的BOM变成问号。

Still Notepad will recognize the BOM and display UTF-8 text. 记事本仍然可以识别BOM并显示UTF-8文本。

Linux nowadays generally uses UTF-8, so has no problems, also in the console. 如今的Linux通常使用UTF-8,因此在控制台中也没有问题。


Further explanation 进一步说明

On Windows System.out uses the console, and that console for instance uses as charset/encoding for instance Cp-850, a single byte charset of some 256 characters. 在Windows上, System.out使用控制台,例如,该控制台将Cp-850(约256个字符的单字节字符集)用作字符集/编码。 Missing might very well be ĉ or the BOM char. 可能是ĉ或BOM字符丢失。 If a java String contains these chars, they can not be encoded to one of the 256 available chars. 如果java字符串包含这些字符,则不能将它们编码为256个可用字符之一。 Hence they will be converted to a ? 因此,它们将被转换为? .

Using a CharsetEncoder : 使用CharsetEncoder

String s = ...
CharsetEncoder encoder = Charset.defaultCharset().newEncoder();
if (!encoder.canEncode(s)) {
    System.out.println("A problem");
}

Windows generally also runs on a single byte encoding, like Cp-1252. Windows通常也以单字节编码运行,例如Cp-1252。 Again 256 chars. 再次为256个字符。 However editors may deal with several encodings, and if the font can represent the character (Unicode code point), then everything works. 但是,编辑器可能会处理几种编码,如果字体可以表示字符(Unicode代码点),则一切正常。

The behavior of java is the same, FileInputStream do not handle bom. java的行为是相同的, FileInputStream不处理bom。

In windows, your file is file1 , file1 hex present is EF BB BF 31 38 在Windows中,文件为file1 ,file1十六进制为EF BB BF 31 38

In linux, your file is file2 , file2's hex present is 31 38 在linux中,您的文件为file2 ,file2的十六进制为31 38

when you read them, you would get different string. 当您阅读它们时,会得到不同的字符串。

I recommend you convert your bom file to without-bom file with notepad++. 我建议您使用记事本++将BOM文件转换为无BOM文件。

Or you can use BOMInputStream 或者您可以使用BOMInputStream

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM