简体   繁体   中英

parsing non-ASCII character in XML document

I'm trying to parse this XML document with a SAX parser:

<?xml version="1.0" encoding="ISO-8859-1"?>
<!DOCTYPE WIN_TPBOUND_MESSAGES SYSTEM "tpbound_messages_v1.dtd">
<WIN_TPBOUND_MESSAGES>
    <SMSTOTP>
        <SOURCE_ADDR>+447522579247</SOURCE_ADDR>
        <TEXT>TEST: @£$¥èéùìò?ØøÅå&amp; ^{}\\[~]¡&#8364;ÆæßÉ!\"#¤%'()*+,-./0123456789:;&lt;=&gt;? ÄÖÑܧ¿äöñüà end</TEXT>
        <WINTRANSACTIONID>652193268</WINTRANSACTIONID>
    </SMSTOTP>
</WIN_TPBOUND_MESSAGES>

After parsing the <TEXT> element, the content is converted to:

TEST: @£$¥èéùìò?Ã�øÃ�Ã¥& ^{}\\[~]¡€Ã�æÃ�Ã�!\"#¤%'()*+,-./0123456789:;<=>? Ã�Ã�Ã�Ã�§¿äöñüà end

So clearly something bad is happening to the non-ASCII characters. The code that parses the XML is shown below:

public void parse(InputStream xmlStream) throws WinGatewayException {
    XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
    parser.setContentHandler(this);
    parser.setErrorHandler(error);
    parser.setEntityResolver(new DTDResolver());
    parser.setDTDHandler(this);
    parser.setFeature("http://xml.org/sax/features/validation", true);
    parser.setFeature("http://apache.org/xml/features/validation/schema", true);
    parser.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", true);
    parser.setFeature("http://xml.org/sax/features/namespace-prefixes", true);
    parser.setFeature("http://apache.org/xml/features/continue-after-fatal-error", false);
    parser.parse(new InputSource(xmlStream));
}

and the object referred to by this has methods such as:

public void endElement(String uri, String localName, String qName)
        throws SAXException {

        if (localName.equals("TEXT")) {   
            logger.debug("Parsed message text: " + cData.toString());
            message.setText(cData.toString());
        }
}

Why aren't these non-ASCII characters being preserved by the XML parser?

I believe your XML file is actually in UTF-8 rather than ISO-8859-1.

An ISO-8859-1-encoded file would have a single byte per character, so the UK pound sign would be a single byte 0xA3. However, it looks like your file has 0xC2 0xA3, which is the byte sequence you'd get for U+00A3 in UTF-8.

Change the XML declaration to reflect this:

<?xml version="1.0" encoding="UTF-8"?>

and see if that fixes things. Assuming it does, you then need to work out what's produced this bad data to start with.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM