简体   繁体   English

如何在不使用BOM且以非ASCII字符开头的情况下识别针对文件的不同编码?

[英]How can I identify different encodings against files without the use of a BOM and beginning with non-ASCII character?

I got a problem when trying to identify the encoding of a file without BOM, particularly when the file is beginning with non-ascii characters. 我在尝试识别没有BOM的文件的编码时遇到了问题,特别是当文件以非ascii字符开头时。

I found following two topics about how to identify encodings for files, 我找到了关于如何识别文件编码的两个主题,

Currently, I created a class to identify different encodings for files (eg UTF-8, UTF-16, UTF-32, UTF-16 no BOM, etc) like following, 目前,我创建了一个类来识别文件的不同编码(例如UTF-8,UTF-16,UTF-32,UTF-16无BOM等),如下所示,

public class UnicodeReader extends Reader {
private static final int BOM_SIZE = 4;
private final InputStreamReader reader;

/**
 * Construct UnicodeReader
 * @param in Input stream.
 * @param defaultEncoding Default encoding to be used if BOM is not found,
 * or <code>null</code> to use system default encoding.
 * @throws IOException If an I/O error occurs.
 */
public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
    byte bom[] = new byte[BOM_SIZE];
    String encoding;
    int unread;
    PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
    int n = pushbackStream.read(bom, 0, bom.length);

    // Read ahead four bytes and check for BOM marks.
    if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
        encoding = "UTF-8";
        unread = n - 3;
    } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
        encoding = "UTF-16BE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
        encoding = "UTF-16LE";
        unread = n - 2;
    } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
        encoding = "UTF-32BE";
        unread = n - 4;
    } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
        encoding = "UTF-32LE";
        unread = n - 4;
    } else {
        // No BOM detected but still could be UTF-16
        int found = 0;
        for (int i = 0; i < 4; i++) {
            if (bom[i] == (byte) 0x00)
                found++;
        }

        if(found >= 2) {
            if(bom[0] == (byte) 0x00){
                encoding = "UTF-16BE";
            }
            else {
                encoding = "UTF-16LE";
            }
            unread = n;
        }
        else {
            encoding = defaultEncoding;
            unread = n;
        }
    }

    // Unread bytes if necessary and skip BOM marks.
    if (unread > 0) {
        pushbackStream.unread(bom, (n - unread), unread);
    } else if (unread < -1) {
        pushbackStream.unread(bom, 0, 0);
    }

    // Use given encoding.
    if (encoding == null) {
        reader = new InputStreamReader(pushbackStream);
    } else {
        reader = new InputStreamReader(pushbackStream, encoding);
    }
}

public String getEncoding() {
    return reader.getEncoding();
}

public int read(char[] cbuf, int off, int len) throws IOException {
    return reader.read(cbuf, off, len);
}

public void close() throws IOException {
    reader.close();
}

} }

The above code could work properly all the cases except when file without BOM and beginning with non-ascii characters. 上述代码可以在所有情况下正常工作,除非文件没有BOM且以非ascii字符开头。 Since under this circumstance, the logic for checking if file still be UTF-16 without BOM will not work correctly, and the encoding will be set as UTF-8 as default. 由于在这种情况下,检查文件是否仍然是没有BOM的UTF-16的逻辑将无法正常工作,并且编码将默认设置为UTF-8。

If there is a way to check encodings of file without BOM and beggining with non-ascii characters, especially for UTF-16 NO BOM file ? 如果有办法检查没有BOM的文件编码和使用非a​​scii字符开始,特别是对于UTF-16 NO BOM文件?

Thanks, any idea would be appreciated. 谢谢,任何想法将不胜感激。

Generally speaking, there is no way to know encoding for sure if it is not provided. 一般来说,如果没有提供编码,就无法确定编码。

You may guess UTF-8 by specific pattern in the texts (high bit set, set, set, not set, set, set, set, not set), but it is still a guess. 您可以通过文本中的特定模式猜测UTF-8(高位设置,设置,设置,未设置,设置,设置,设置,未设置),但它仍然是猜测。

UTF-16 is a hard one; UTF-16是一个很难的; you can successfully parse BE and LE on the same stream; 您可以在同一个流上成功解析BE和LE; both ways it will produce some characters (potentially meaningless text though). 这两种方式都会产生一些字符(尽管可能是毫无意义的文字)。

Some code out there uses statistical analysis to guess the encoding by the frequency of the symbols, but that requires some assumptions about the text (ie "this is a Mongolian text") and frequencies tables (which may not match the text). 那里的一些代码使用统计分析来通过符号的频率来猜测编码,但是这需要关于文本的一些假设(即“这是蒙古语文本”)和频率表(可能与文本不匹配)。 At the end of the day this remains just a guess, and cannot help in 100% of cases. 在一天结束时,这仍然只是猜测,并且在100%的情况下无法帮助。

The best approach is not to try and implement this yourself. 最好的方法不是自己尝试实现它。 Instead use an existing library to do this; 而是使用现有的库来执行此操作; see Java : How to determine the correct charset encoding of a stream . 请参阅Java:如何确定流的正确charset编码 For instance: 例如:

It should be noted that the best that can be done is to guess at the most likely encoding for the file. 应该注意,可以做的最好的事情是猜测文件的最可能的编码。 In the general case, it is impossible to be 100% sure that you've figured out the correct encoding; 在一般情况下,你不可能100%确定你已经找到了正确的编码; ie the encoding that was used when creating the file. 即创建文件时使用的编码。


I would say these third-party libraries are also cannot identify encodings for the file I encountered [...] they could be improved to meet my requirement. 我会说这些第三方库也无法识别我遇到的文件的编码[...]它们可以改进以满足我的要求。

Alternatively, you could recognize that your requirement is exceedingly hard to meet ... and change it; 或者,你可以认识到你的要求非常难以满足......并改变它; eg 例如

  • restrict yourself to a certain set of encodings, 限制自己使用某些编码,
  • insist that the person who provides / uploads the file correctly state what its encoding (or primary language) is, and/or 坚持提供/上传文件的人正确说明其编码(或主要语言)是什么,和/或
  • accept that your system is going to get it wrong a certain percent of the time, and provide the means whereby someone can correct incorrectly stated / guessed encodings. 接受你的系统会在某个百分比的时间内弄错,并提供一种方法,使某人可以纠正错误陈述/猜测的编码。

Face the facts: this is a THEORETICALLY UNSOLVABLE problem. 面对事实:这是一个理论上无法解决的问题。

If you are certain that it is a valid Unicode stream, it must be UTF-8 if it has no BOM (since a BOM is neither required nor recommended), and if it does have one, then you know what it is. 如果您确定它是一个有效的Unicode流,如果它没有BOM(因为既不需要也不推荐使用BOM),它必须是UTF-8,如果它没有,那么您就知道它是什么。

If it is just some random encoding, there is no way to know for certain. 如果它只是一些随机编码,则无法确定。 The best you can hope for is then to only be wrong sometimes, since there is impossible to guess correctly in all cases. 你可以期望的最好的只是有时候是错误的,因为在所有情况下都无法正确猜测。

If you can limit the possibilities to a very small subset, it is possible to improve the odds of your guess being right . 如果您可以将可能性限制在一个非常小的子集中, 则可以提高猜测正确的几率

The only reliable way is to require the provider to tell you what they are providing. 唯一可靠的方法是要求提供商告诉您他们提供的内容。 If you want complete reliability, that is your only choice. 如果您想要完全可靠,那么这是您唯一的选择。 If you do not require reliability, then you guess — but sometimes guess wrong. 如果你不需要可靠性,那么你猜 - 但有时猜错了。

I have the feeling that you must be a Windows person, since the rest of us seldom have cause for BOMs in the first place. 我觉得你必须是一个Windows用户,因为我们其他人一开始很少有BOM的原因。 I know that I regularly deal with tgagabytes of text (on Macs, Linux, Solaris, and BSD systems), more than 99% of it UTF-8, and only twice have I come across BOM-laden text. 我知道我经常处理tgagabytes的文本(在Mac,Linux,Solaris和BSD系统上),超过99%的UTF-8,只有两次我遇到了BOM文本。 I have heard Windows people get stuck with it all the time though. 我听说过Windows用户一直都会遇到它。 If true this may, or may not, make your choices easier. 如果这是真的,这可能会或可能不会使您的选择更容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM