How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

Question

My application downloads xml files that happen to be either encoded in UTF-8 or ISO-8859-1 (the software that generates those files is crappy so it does that). I'm from Germany so we're using Umlauts (ä,ü,ö) so it really makes a difference how those files are encoded. I know that the XmlPullParser has a method .getInputEncoding() which correctly detects how my files are encoded. However I have to set the encoding in my FileInputStream already (which is before I get to call .getInputEncoding() ). So far I'm just using a BufferedReader to read the XML file and search for the entry that specifies the encoding and then instantiate my PullParser afterwards.

private void setFileEncoding() {
    try {
        bufferedReader.reset();
        String firstLine = bufferedReader.readLine();
        int start = firstLine.indexOf("encoding=") + 10; // +10 to actually start after "encoding="

        String encoding = firstLine.substring(start, firstLine.indexOf("\"", start));

        // now set the encoding to the reader to be used for parsing afterwards
        bufferedReader = new BufferedReader(new InputStreamReader(fileInputStream, encoding));
        bufferedReader.mark(0);
    } catch (IOException e) {
        e.printStackTrace();
    }
}

Is there a different way to do this? Can I take advantage of the .getInputEncoding method? Right now the method seems kinda useless to me because how does my encoding matter if I've already had to set it before being able to check for it.

Answer 1

If you trust the creator of the XML to have set the encoding correctly in the XML declaration, you can sniff it as you're doing. However, be aware that it can be wrong; it can disagree with the actual encoding .

If you want to detect the encoding directly, independently of the (potentially wrong) XML declaration encoding setting, use a library such as ICU CharsetDetector or the older jChardet.

ICU CharsetDetector:

CharsetDetector detector;
CharsetMatch match;
byte[] byteData = ...;

detector = new CharsetDetector();

detector.setText(byteData);
match = detector.detect();

jChardet:

    // Initalize the nsDetector() ;
    int lang = (argv.length == 2)? Integer.parseInt(argv[1])
                                     : nsPSMDetector.ALL ;
    nsDetector det = new nsDetector(lang) ;

    // Set an observer...
    // The Notify() will be called when a matching charset is found.

    det.Init(new nsICharsetDetectionObserver() {
            public void Notify(String charset) {
                HtmlCharsetDetector.found = true ;
                System.out.println("CHARSET = " + charset);
            }
    });

    URL url = new URL(argv[0]);
    BufferedInputStream imp = new BufferedInputStream(url.openStream());

    byte[] buf = new byte[1024] ;
    int len;
    boolean done = false ;
    boolean isAscii = true ;

    while( (len=imp.read(buf,0,buf.length)) != -1) {

            // Check if the stream is only ascii.
            if (isAscii)
                isAscii = det.isAscii(buf,len);

            // DoIt if non-ascii and not done yet.
            if (!isAscii && !done)
                done = det.DoIt(buf,len, false);
    }
    det.DataEnd();

    if (isAscii) {
       System.out.println("CHARSET = ASCII");
       found = true ;
    }

Answer 2

如果您的服务器发送正确，则您可能能够从content-type标头中获取正确的字符集。

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

Question

2 answers

solution1
0 ACCPTED 2016-09-19 03:14:40

ICU CharsetDetector:

jChardet:

solution2
0 2016-09-19 04:40:47

How to retrieve the encoding of an XML file to parse it correctly? (Best Practice)

Question

2 answers

solution1 0 ACCPTED 2016-09-19 03:14:40

ICU CharsetDetector:

jChardet:

solution2 0 2016-09-19 04:40:47

solution1
0 ACCPTED 2016-09-19 03:14:40

solution2
0 2016-09-19 04:40:47