简体   繁体   English

如何从Java中删除XML文件中的BOM

[英]How to Remove BOM from an XML file in Java

我需要有关从UTF-8文件中删除BOM的方法的建议,并创建其余xml文件的副本。

Having a tool breaking because of a BOM in an UTF-8 file is a very common thing in my experience. 根据我的经验,由于UTF-8文件中的BOM而导致工具崩溃是非常常见的事情。 I don't know why there where so many downvotes (but then it gives me the chance to try to get enough vote to win a special SO badge ; ) 我不知道为什么会有这么多的downvotes(但它让我有机会获得足够的投票来赢得一个特殊的SO徽章;)

More seriously: an UTF-8 BOM doesn't typically make that much sense but it is fully valid (although discouraged) by the specs. 更严重的是:UTF-8 BOM通常没有多大意义, 它完全有效(尽管不鼓励)规范。 Now the problem is that a lot of people aren't aware that a BOM is valid in UTF-8 and hence wrote broken tools / APIs that do not process correctly these files. 现在的问题是很多人都不知道BOM在UTF-8中是有效的,因此编写了不能正确处理这些文件的破坏的工具/ API。

Now you may have two different issues: you may want to process the file from Java or you need to use Java to programmatically create/fix files that other (broken) tools need. 现在您可能有两个不同的问题:您可能希望从Java处理文件,或者您需要使用Java以编程方式创建/修复其他(损坏的)工具所需的文件。

I've had the case in one consulting gig where the helpdesk would keep getting messages from users that had problems with some text editor that would mess up perfectly valid UTF-8 files produced by Java. 我在一个咨询工具中遇到过这样的情况:帮助台会不断收到来自用户的消息,这些消息会导致某些文本编辑器出现问题,而这些编辑器会破坏Java生成的完整有效的UTF-8文件。 So I had to work around that issue by making sure to remove the BOM from every single UTF-8 file we were dealing with. 因此,我必须通过确保从我们正在处理的每个UTF-8文件中删除BOM来解决该问题。

I you want to delete a BOM from a file, you could create a new file and skip the first three bytes. 我想从文件中删除BOM,您可以创建一个新文件并跳过前三个字节。 For example: 例如:

... $  file  /tmp/src.txt 
/tmp/src.txt: UTF-8 Unicode (with BOM) English text

... $  ls -l  /tmp/src.txt 
-rw-rw-r-- 1 tact tact 1733 2012-03-16 14:29 /tmp/src.txt

... $  hexdump  -C  /tmp/src.txt | head -n 1
00000000  ef bb bf 50 6f 6b 65 ...

As you can see, the file starts with "ef bb bf", this is the (fully valid) UTF-8 BOM. 如您所见,文件以“ef bb bf”开头,这是(完全有效的)UTF-8 BOM。

Here's a method that takes a file and makes a copy of it by skipping the first three bytes: 这是一个获取文件并通过跳过前三个字节来复制它的方法:

 public static void workAroundbrokenToolsAndAPIs(File sourceFile, File destFile) throws IOException {
    if(!destFile.exists()) {
        destFile.createNewFile();
    }

    FileChannel source = null;
    FileChannel destination = null;

    try {
        source = new FileInputStream(sourceFile).getChannel();
        source.position(3);
        destination = new FileOutputStream(destFile).getChannel();
        destination.transferFrom( source, 0, source.size() - 3 );
    }
    finally {
        if(source != null) {
            source.close();
        }
        if(destination != null) {
            destination.close();
        }
    }
}

Note that it's "raw": you'd typically want to first make sure you have a BOM before calling this or "Bad Thinks May Happen" [TM]. 请注意,它是“原始的”:您通常希望首先确保您有一个BOM,然后再调用它或“Bad Thinks May Happen”[TM]。

You can look at your file afterwards: 您可以在以后查看您的文件:

... $  file  /tmp/dst.txt 
/tmp/dst.txt: UTF-8 Unicode English text

... $  ls -l  /tmp/dst.txt 
-rw-rw-r-- 1 tact tact 1730 2012-03-16 14:41 /tmp/dst.txt

... $  hexdump -C /tmp/dst.txt
00000000  50 6f 6b 65 ...

And the BOM is gone... BOM已经不见了......

Now if you simply want to transparently remove the BOM for one your broken Java API, then you could use the pushbackInputStream described here: why org.apache.xerces.parsers.SAXParser does not skip BOM in utf8 encoded xml? 现在,如果您只是想透明地删除已损坏的Java API的BOM,那么您可以使用此处描述的pushbackInputStream为什么org.apache.xerces.parsers.SAXParser不会跳过utf8编码的xml中的BOM?

private static InputStream checkForUtf8BOMAndDiscardIfAny(InputStream inputStream) throws IOException {
    PushbackInputStream pushbackInputStream = new PushbackInputStream(new BufferedInputStream(inputStream), 3);
    byte[] bom = new byte[3];
    if (pushbackInputStream.read(bom) != -1) {
        if (!(bom[0] == (byte) 0xEF && bom[1] == (byte) 0xBB && bom[2] == (byte) 0xBF)) {
            pushbackInputStream.unread(bom);
        }
    }
    return pushbackInputStream; }

Note that this works, but shall definitely NOT fix the more serious issue where you can have other tools in the work chain not working correctly with UTF-8 files having a BOM. 需要注意的是这个作品,但应当肯定没有解决更严重的问题,你可以有其他工具在工作链不与具有BOM UTF-8的文件正常工作。

And here's a link to a question with a more complete answer, covering other encodings as well: 这里是一个带有更完整答案的问题的链接,也包括其他编码:

Byte order mark screws up file reading in Java 字节顺序标记用Java解压缩文件读取

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM