如何檢索XML文件的編碼以正確解析？（最佳實踐）

Question

我的應用程序下載了碰巧以UTF-8或ISO-8859-1編碼的xml文件（生成這些文件的軟件非常糟糕，因此可以這樣做）。 我來自德國，因此我們使用的是Umlauts（ä，ü，ö），因此，這些文件的編碼方式確實有所不同。 我知道XmlPullParser具有.getInputEncoding()方法，該方法可以正確檢測我的文件的編碼方式。 但是，我必須已經在FileInputStream設置了編碼（這是在調用.getInputEncoding() ）。 到目前為止，我只是使用BufferedReader讀取XML文件並搜索指定編碼的條目，然后實例化我的PullParser。

private void setFileEncoding() {
    try {
        bufferedReader.reset();
        String firstLine = bufferedReader.readLine();
        int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="

        String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));

        // now set the encoding to the reader to be used for parsing afterwards
        bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
        bufferedReader.mark(0);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

有其他方法可以做到這一點嗎？ 我可以利用.getInputEncoding方法嗎？ 現在，該方法對我來說似乎毫無用處，因為如果我必須先進行設置才能進行編碼，則我的編碼有多重要？

Answer 1

如果您相信XML的創建者已在XML聲明中正確設置了編碼，則可以在執行操作時進行嗅探。 但是，請注意這可能是錯誤的。 它可能與實際編碼不一致 。

如果要獨立於（可能錯誤的）XML聲明編碼設置直接檢測編碼，請使用諸如ICU CharsetDetector或較舊的jChardet之類的庫。

ICU CharsetDetector：

CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;

detector = new CharsetDetector();

detector.setText(byteData);
match = detector.detect();

jChardet：

    // Initalize the nsDetector() ;
    int lang = (argv.length == 2)? Integer.parseInt(argv[1])
                                     : nsPSMDetector.ALL ;
    nsDetector det = new nsDetector(lang) ;

    // Set an observer...
    // The Notify() will be called when a matching charset is found.

    det.Init(new nsICharsetDetectionObserver() {
            public void Notify(String charset) {
                HtmlCharsetDetector.found = true ;
                System.out.println("CHARSET = " + charset);
            }
    });

    URL url = new URL(argv[0]);
    BufferedInputStream imp = new BufferedInputStream(url.openStream());

    byte[] buf = new byte[1024] ;
    int len;
    boolean done = false ;
    boolean isAscii = true ;

    while( (len=imp.read(buf,0,buf.length)) != -1) {

            // Check if the stream is only ascii.
            if (isAscii)
                isAscii = det.isAscii(buf,len);

            // DoIt if non-ascii and not done yet.
            if (!isAscii && !done)
                done = det.DoIt(buf,len, false);
    }
    det.DataEnd();

    if (isAscii) {
       System.out.println("CHARSET = ASCII");
       found = true ;
    }

Answer 2

如果您的服務器發送正確，則您可能能夠從content-type標頭中獲取正確的字符集。

如何檢索XML文件的編碼以正確解析？（最佳實踐）

問題描述

2 個解決方案

解決方案1
0 已采納 2016-09-19 03:14:40

ICU CharsetDetector：

jChardet：

解決方案2
0 2016-09-19 04:40:47

如何檢索XML文件的編碼以正確解析？ （最佳實踐）

問題描述

2 個解決方案

解決方案1 0 已采納 2016-09-19 03:14:40

ICU CharsetDetector：

jChardet：

解決方案2 0 2016-09-19 04:40:47

如何檢索XML文件的編碼以正確解析？（最佳實踐）

解決方案1
0 已采納 2016-09-19 03:14:40

解決方案2
0 2016-09-19 04:40:47