简体   繁体   中英

Dynamic SAX Parser for UTF-8 or ISO-8859-1 encoded XML

I am developing an app for Android where I have to parse different XML files. Most of them are encoded in UTF-8, but a few may be encoded in ISO-8859-1.

  HttpURLConnection con = (HttpURLConnection) url.openConnection();
  ...
  in = con.getInputStream();
  InputSource is = new InputSource(in);
  ...
  parser.parse(is, handler);

My code for handling the input looks like above. The java documentation says about the InputSource :

If there is no character stream, but there is a byte stream, the parser will use that byte stream, using the encoding specified in the InputSource or else (if no encoding is specified) autodetecting the character encoding using an algorithm such as the one in the XML specification.

I am passing in a ByteStream and I am don't specify an encoding, so according to the documentation the encoding should be auto detected. But it doesn't. All files that are encoded in UTF-8 are fine, but the ISO-8859-1 ones are not (I am getting a Parser Expat... Exception for some invalid characters ). If I set the encoding of the InputSource manually to "ISO-8859-1" it behaves the other way round.

How can I solve this? I searched Google and Stackoverflow for hours, but not finding a solution. I also tried to pass a CharacterStream to the InputSource , but some characters (äöüÄÖÜß) in the ISO-8859-1 files are still displayed as "?" in my app.

Thanks in advance!

I would suggest to check if there are characters which are not in the old ascii set and reencode the string if there seems to be UTF-8 chars:

String output=new String(input.getBytes("8859_1"), "utf-8");

That line takes the ISO-8859-1 and converts it to utf-8 which is used by Java.

The best solution depends on the exact cause of your problem. If you retrieve an XML document over HTTP, the encoding may also be specified in the Content-Type response header and not necessarily in the XML document itself. If that is the case and the XML libraries in Android are correctly implemented (I have no way to check here if the Content+Type header is evaluated), you should be able to create an InputSource with the URL directly new InputSource("http://..."); instead.

If the encoding is not set in the HTTP header and not specified in the XML prologue, the parser operates correctly if it assumes UTF-8 encoding (as mandated by the XML specification). The autodetection mentioned in the documentation does not mean that the parser actually looks into the document content to make an assumption on the encoding, but means that it checks the encoding attribute of the XML stream. If the encoding attribute is missing, it defaults to UTF-8.

The most straightforward way would be to use UTF-8, and if the parser exception for invalid byte is thrown, attempt to reparse it as Windows-1252. 1252 because I doubt you will see anyone using the ISO-8859-1 C1 characters where as you will see people using Windows 1252 characters and claim it is ISO-8859-1 all the time.

I suggest to let SAX decide about encoding, it will know it from XML declaration encoding attribute

<?xml version="1.0" encoding="utf-8"?>

Note: if there is no xml declaration, which is legal, then encoding is assumed to be UTF-8

If you use byte stream InputSource, as in your example, and do not set InputStream encoding explicitly then SAX will take encoding from XML

UPDATE

Try this test. It writes xml string to 1.xml file in iso-8859-1. Then SAX parses it and prints root element text (it is only one character 'ä'). SAX is supposed to undestand that 1.xmk uses iso-8859-1 otherwise output will be distorted

String xml = "<?xml version='1.0' encoding='iso-8859-1'?><root>ä</root>";
OutputStreamWriter wrt = new OutputStreamWriter(new FileOutputStream(
        "1.xml"), "iso-8859-1");
wrt.write(xml);
wrt.close();
SAXParserFactory sf = SAXParserFactory.newInstance();
SAXParser p = sf.newSAXParser();
p.parse(new FileInputStream("1.xml"), new DefaultHandler() {
    public void characters(char[] ch, int start, int length)
            throws SAXException {
        System.out.println((int)ch[start]);
        System.out.println(String.valueOf(ch, start, length));
    }
});

See output

228
ä

It is correct. SAX undestands that XML encoding = 'iso-8859-1'.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM