简体   繁体   English

读取 UTF-8 - BOM 标记

[英]Reading UTF-8 - BOM marker

I'm reading a file through a FileReader - the file is UTF-8 decoded (with BOM) now my problem is: I read the file and output a string, but sadly the BOM marker is outputted too.我正在通过 FileReader 读取文件 - 该文件是 UTF-8 解码的(带有 BOM)现在我的问题是:我读取文件并输出一个字符串,但遗憾的是 BOM 标记也被输出。 Why this occurs?为什么会出现这种情况?

fr = new FileReader(file);
br = new BufferedReader(fr);
    String tmp = null;
    while ((tmp = br.readLine()) != null) {
    String text;    
    text = new String(tmp.getBytes(), "UTF-8");
    content += text + System.getProperty("line.separator");
}

output after first line第一行之后的输出

?<style>

In Java, you have to consume manually the UTF8 BOM if present.在 Java 中,您必须手动使用 UTF8 BOM(如果存在)。 This behaviour is documented in the Java bug database, here and here .此行为记录在 Java 错误数据库中,此处此处 There will be no fix for now because it will break existing tools like JavaDoc or XML parsers.目前还没有修复,因为它会破坏现有的工具,如 JavaDoc 或 XML 解析器。 The Apache IO Commons provides a BOMInputStream to handle this situation. Apache IO Commons提供了一个BOMInputStream来处理这种情况。

Take a look at this solution: Handle UTF8 file with BOM看看这个解决方案: Handle UTF8 file with BOM

["

The easiest fix is probably just to remove the resulting \<\/code> from the string, since it is extremely unlikely to appear for any other reason.<\/i>最简单的修复可能只是从字符串中删除生成的\<\/code> ,因为它极不可能因为任何其他原因出现。<\/b><\/p>

tmp = tmp.replace("\uFEFF", "");

Use the Apache Commons library .使用Apache Commons 库

Class: org.apache.commons.io.input.BOMInputStream类: org.apache.commons.io.input.BOMInputStream

Example usage:示例用法:

String defaultEncoding = "UTF-8";
InputStream inputStream = new FileInputStream(someFileWithPossibleUtf8Bom);
try {
    BOMInputStream bOMInputStream = new BOMInputStream(inputStream);
    ByteOrderMark bom = bOMInputStream.getBOM();
    String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
    InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bOMInputStream), charsetName);
    //use reader
} finally {
    inputStream.close();
}

Here's how I use the Apache BOMInputStream, it uses a try-with-resources block.这是我使用 Apache BOMInputStream 的方法,它使用了一个 try-with-resources 块。 The "false" argument tells the object to ignore the following BOMs (we use "BOM-less" text files for safety reasons, haha): “false”参数告诉对象忽略以下 BOM(出于安全原因,我们使用“BOM-less”文本文件,哈哈):

try( BufferedReader br = new BufferedReader( 
    new InputStreamReader( new BOMInputStream( new FileInputStream(
       file), false, ByteOrderMark.UTF_8,
        ByteOrderMark.UTF_16BE, ByteOrderMark.UTF_16LE,
        ByteOrderMark.UTF_32BE, ByteOrderMark.UTF_32LE ) ) ) )
{
    // use br here

} catch( Exception e)

}

Consider UnicodeReader from Google which does all this work for you.考虑一下 Google 的UnicodeReader ,它可以为您完成所有这些工作。

Charset utf8 = Charset.forName("UTF-8"); // default if no BOM present
try (Reader r = new UnicodeReader(new FileInputStream(file), utf8)) {
    ....
}

Maven Dependency: Maven依赖:

<dependency>
    <groupId>com.google.gdata</groupId>
    <artifactId>core</artifactId>
    <version>1.47.1</version>
</dependency>

Use Apache Commons IO .使用Apache Commons IO

For example, let's take a look on my code (used for reading a text file with both latin and cyrillic characters) below:例如,让我们看一下下面的代码(用于读取包含拉丁字符和西里尔字符的文本文件):

String defaultEncoding = "UTF-16";
InputStream inputStream = new FileInputStream(new File("/temp/1.txt"));

BOMInputStream bomInputStream = new BOMInputStream(inputStream);

ByteOrderMark bom = bomInputStream.getBOM();
String charsetName = bom == null ? defaultEncoding : bom.getCharsetName();
InputStreamReader reader = new InputStreamReader(new BufferedInputStream(bomInputStream), charsetName);
int data = reader.read();
while (data != -1) {

 char theChar = (char) data;
 data = reader.read();
 ari.add(Character.toString(theChar));
}
reader.close();

As a result we have an ArrayList named "ari" with all characters from file "1.txt" excepting BOM.结果,我们有一个名为“ari”的 ArrayList,其中包含文件“1.txt”中除 BOM 之外的所有字符。

If somebody wants to do it with the standard, this would be a way:如果有人想用标准来做,这将是一种方式:

public static String cutBOM(String value) {
    // UTF-8 BOM is EF BB BF, see https://en.wikipedia.org/wiki/Byte_order_mark
    String bom = String.format("%x", new BigInteger(1, value.substring(0,3).getBytes()));
    if (bom.equals("efbbbf"))
        // UTF-8
        return value.substring(3, value.length());
    else if (bom.substring(0, 2).equals("feff") || bom.substring(0, 2).equals("ffe"))
        // UTF-16BE or UTF16-LE
        return value.substring(2, value.length());
    else
        return value;
}

It's mentioned here that this is usually a problem with files on Windows. 这里提到,这通常是 Windows 上文件的问题。

One possible solution would be running the file through a tool like dos2unix first.一种可能的解决方案是首先通过 dos2unix 之类的工具运行文件。

The easiest way I found to bypass BOM我发现绕过 BOM 的最简单方法

BufferedReader br = new BufferedReader(new InputStreamReader(fis));    
while ((currentLine = br.readLine()) != null) {
                    //case of, remove the BOM of UTF-8 BOM
                    currentLine = currentLine.replace("","");

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM