读取 UTF-8 - BOM 标记

Question

我正在通过 FileReader 读取文件 - 该文件是 UTF-8 解码的（带有 BOM）现在我的问题是：我读取文件并输出一个字符串，但遗憾的是 BOM 标记也被输出。 为什么会出现这种情况？

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

第一行之后的输出

?<style>

Answer 1

在 Java 中，您必须手动使用 UTF8 BOM（如果存在）。 此行为记录在 Java 错误数据库中，此处和此处。 目前还没有修复，因为它会破坏现有的工具，如 JavaDoc 或 XML 解析器。 Apache IO Commons提供了一个BOMInputStream来处理这种情况。

看看这个解决方案： Handle UTF8 file with BOM

Answer 2

["

tmp = tmp.replace("\uFEFF", "");

Answer 3

使用Apache Commons 库。

类： org.apache.commons.io.input.BOMInputStream

示例用法：

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

Answer 4

这是我使用 Apache BOMInputStream 的方法，它使用了一个 try-with-resources 块。 “false”参数告诉对象忽略以下 BOM（出于安全原因，我们使用“BOM-less”文本文件，哈哈）：

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

Answer 5

考虑一下 Google 的UnicodeReader ，它可以为您完成所有这些工作。

Charset utf8 = Charset.forName("UTF-8"); // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8)) {
    ....
}

Maven依赖：

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

Answer 6

使用Apache Commons IO 。

例如，让我们看一下下面的代码（用于读取包含拉丁字符和西里尔字符的文本文件）：

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

结果，我们有一个名为“ari”的 ArrayList，其中包含文件“1.txt”中除 BOM 之外的所有字符。

Answer 7

如果有人想用标准来做，这将是一种方式：

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}

Answer 8

这里提到，这通常是 Windows 上文件的问题。

一种可能的解决方案是首先通过 dos2unix 之类的工具运行文件。

Answer 9

我发现绕过 BOM 的最简单方法

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("ï»¿","");

读取 UTF-8 - BOM 标记

问题描述

9 个解决方案

解决方案1
89 已采纳 2011-02-04 12:32:03

解决方案2
45 2011-02-04 12:32:16

解决方案3
39 2012-12-21 10:26:54

解决方案4
9 2016-05-25 19:25:58

解决方案5
8 2018-02-12 15:03:46

解决方案6
6 2017-07-01 15:22:13

解决方案7
3 2019-03-20 16:03:28

解决方案8
2 2017-02-26 22:43:50

解决方案9
1 2017-10-26 06:25:49

读取 UTF-8 - BOM 标记

问题描述

9 个解决方案

解决方案1 89 已采纳 2011-02-04 12:32:03

解决方案2 45 2011-02-04 12:32:16

解决方案3 39 2012-12-21 10:26:54

解决方案4 9 2016-05-25 19:25:58

解决方案5 8 2018-02-12 15:03:46

解决方案6 6 2017-07-01 15:22:13

解决方案7 3 2019-03-20 16:03:28

解决方案8 2 2017-02-26 22:43:50

解决方案9 1 2017-10-26 06:25:49

解决方案1
89 已采纳 2011-02-04 12:32:03

解决方案2
45 2011-02-04 12:32:16

解决方案3
39 2012-12-21 10:26:54

解决方案4
9 2016-05-25 19:25:58

解决方案5
8 2018-02-12 15:03:46

解决方案6
6 2017-07-01 15:22:13

解决方案7
3 2019-03-20 16:03:28

解决方案8
2 2017-02-26 22:43:50

解决方案9
1 2017-10-26 06:25:49