简体   繁体   中英

Parse XML file containing umlaute using SAX parser

I have looked through a lot of posts regarding the same problem, but i can't figure it out. I trying to parse a XML file with umlauts in it. This is what i have now:

File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");

InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");

saxParser.parse(is, handlerConfig);

But it won't get umlauts properly. Ä,Ü and Ö will be only weird characters. The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?>

What I'm doing wrong?

First rule: Don't second guess the encoding used in the XML document. Always use byte streams to parse XML documents:

InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);

If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there.

Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. Have you?

Third rule: Check the byte values in the source document. Bring up your favourite HEX editor/viewer and inspect the content. For example, the letter Ä should be the byte sequence 0xC3 0x84 , if the encoding is UTF-8.

Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts .

UPDATE:

The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3 is the letter à , and the second byte is the letter ¤ , or currency sign (Unicode U+00A4), which may look like a circle.

Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. I have no idea if it is possible to change that or not. But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView".

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM