使用BOM表字符串处理utf-8时java的行为不一致

Question

I open my Windows notepad, enter 18 , and save the file as utf-8 encoding. 我打开Windows记事本，输入18 ，然后将文件另存为utf-8编码。 I know that my file will have a BOM header, and my file is a utf-8 encoded file(with a BOM header). 我知道我的文件将带有BOM表头，并且我的文件是utf-8编码的文件（带有BOM表头）。

Problem is that, when printing that string by below code: 问题是，通过以下代码打印该字符串时：

//str is that string read from the file using StandardCharsets.UTF_8 encoding
System.out.println(str);

In windows I got: 在Windows中，我得到了：

?18

But in linux I got: 但是在Linux中，我得到了：

So why the behavior of java is different? 那么，为什么Java的行为不同？ How to understand it? 怎么理解呢？

Answer 1

A BOM is a zero-width space, so invisible in principle. BOM是零宽度的空间，因此原则上不可见。

However Window has no UTF-8 encoding but uses one of the many single byte encodings. 但是，Window没有UTF-8编码，但是使用许多单字节编码之一。 The conversion from String to the output will turn the BOM, missing in the charset, into a question mark. 从字符串到输出的转换会将字符集中缺少的BOM变成问号。

Still Notepad will recognize the BOM and display UTF-8 text. 记事本仍然可以识别BOM并显示UTF-8文本。

Linux nowadays generally uses UTF-8, so has no problems, also in the console. 如今的Linux通常使用UTF-8，因此在控制台中也没有问题。

Further explanation 进一步说明

On Windows System.out uses the console, and that console for instance uses as charset/encoding for instance Cp-850, a single byte charset of some 256 characters. 在Windows上， System.out使用控制台，例如，该控制台将Cp-850（约256个字符的单字节字符集）用作字符集/编码。 Missing might very well be ĉ or the BOM char. 可能是ĉ或BOM字符丢失。 If a java String contains these chars, they can not be encoded to one of the 256 available chars. 如果java字符串包含这些字符，则不能将它们编码为256个可用字符之一。 Hence they will be converted to a ? 因此，它们将被转换为? . 。

Using a CharsetEncoder : 使用CharsetEncoder ：

String s = ...
CharsetEncoder encoder = Charset.defaultCharset().newEncoder();
if (!encoder.canEncode(s)) {
    System.out.println("A problem");
}

Windows generally also runs on a single byte encoding, like Cp-1252. Windows通常也以单字节编码运行，例如Cp-1252。 Again 256 chars. 再次为256个字符。 However editors may deal with several encodings, and if the font can represent the character (Unicode code point), then everything works. 但是，编辑器可能会处理几种编码，如果字体可以表示字符（Unicode代码点），则一切正常。

Answer 2

The behavior of java is the same, FileInputStream do not handle bom. java的行为是相同的， FileInputStream不处理bom。

In windows, your file is file1 , file1 hex present is EF BB BF 31 38 在Windows中，文件为file1 ，file1十六进制为EF BB BF 31 38

In linux, your file is file2 , file2's hex present is 31 38 在linux中，您的文件为file2 ，file2的十六进制为31 38

when you read them, you would get different string. 当您阅读它们时，会得到不同的字符串。

I recommend you convert your bom file to without-bom file with notepad++. 我建议您使用记事本++将BOM文件转换为无BOM文件。

Or you can use BOMInputStream 或者您可以使用BOMInputStream

使用BOM表字符串处理utf-8时java的行为不一致

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-03-21 05:09:04

解决方案2
0 2019-03-21 05:01:18

使用BOM表字符串处理utf-8时java的行为不一致

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-03-21 05:09:04

解决方案2 0 2019-03-21 05:01:18

解决方案1
2 已采纳 2019-03-21 05:09:04

解决方案2
0 2019-03-21 05:01:18