使用SAX解析器解析包含umlaute的XML文件

Question

I have looked through a lot of posts regarding the same problem, but i can't figure it out. 我浏览过很多关于同一问题的文章，但我不知道。 I trying to parse a XML file with umlauts in it. 我试图解析带有变音符号的XML文件。 This is what i have now: 这就是我现在所拥有的：

File file = new File(this.xmlConfig);
InputStream inputStream= new FileInputStream(file);
Reader reader = new InputStreamReader(inputStream,"UTF-8");

InputSource is = new InputSource(reader);
is.setEncoding("UTF-8");

saxParser.parse(is, handlerConfig);

But it won't get umlauts properly. 但是它不会正确地变音。 Ä,Ü and Ö will be only weird characters. Ä，Ü和Ö只是怪异的字符。 The file is definitely in utf-8 and it is declared as such with the first line like this: <?xml version="1.0" encoding="utf-8"?> 该文件肯定在utf-8中，并用第一行这样声明： <?xml version="1.0" encoding="utf-8"?>

What I'm doing wrong? 我做错了什么？

Answer 1

First rule: Don't second guess the encoding used in the XML document. 第一条规则：不要再猜测XML文档中使用的编码。 Always use byte streams to parse XML documents: 始终使用字节流来解析XML文档：

InputStream inputStream= new FileInputStream(this.xmlConfig);
InputSource is = new InputSource(inputStream);
saxParser.parse(is, handlerConfig);

If that doesn't work, the <?xml version=".." encoding="UTF-8" ?> (or whatever) in the XML is wrong, and you have to take it from there. 如果这不起作用，则<?xml version=".." encoding="UTF-8" ?>的<?xml version=".." encoding="UTF-8" ?> （或其他）错误，您必须从那里开始。

Second rule: Make sure you inspect the the result with a tool that supports the encoding used in the target, or result, document. 第二条规则：确保使用支持目标文档或结果文档中使用的编码的工具检查结果。 Have you? 你有吗

Third rule: Check the byte values in the source document. 第三条规则：检查源文档中的字节值。 Bring up your favourite HEX editor/viewer and inspect the content. 调出您喜欢的HEX编辑器/查看器并检查内容。 For example, the letter Ä should be the byte sequence 0xC3 0x84 , if the encoding is UTF-8. 例如，如果编码为UTF-8，则字母Ä应为字节序列0xC3 0x84 。

Forth rule: If it doesn't look correct, always suspect that the UTf-8 source is viewed, or interpreted, as an ISO-8859-1 source. 第四规则：如果它看起来并不正确，总是怀疑UTF-8源观看，或解释，作为ISO-8859-1源。 Verify this by comparing the first and second byte from the UTF-8 source with the ISO 8859-1 code charts . 通过将UTF-8源中的第一个和第二个字节与ISO 8859-1代码表进行比较来验证这一点。

UPDATE: 更新：

The byte sequence for the UNICODE letter ä (latin small letter a with diaresis, U+00E4) is 0xC3 0xA4 in the UTF-8 encoding. 在UTF-8编码中，UNICODE字母ä （带有偏音的拉丁小写字母a，U + 00E4）的字节序列为0xC3 0xA4 。 If you use a viewing tool that only understands (or is configured to interpret the source as) ISO-8859-1 encoding, the first byte, 0xC3 is the letter Ã , and the second byte is the letter ¤ , or currency sign (Unicode U+00A4), which may look like a circle. 如果使用仅理解（或配置为将源解释为）ISO-8859-1编码的查看工具，则第一个字节0xC3是字母Ã ，第二个字节是字母¤或货币符号（Unicode） U + 00A4），看起来像一个圆圈。

Hence, the "TextView" thingy in Android is interpreting your input as an ISO-8859-1 stream. 因此，Android中的“ TextView”将您的输入解释为ISO-8859-1流。 I have no idea if it is possible to change that or not. 我不知道是否可以更改它。 But if you have your parsing result as a String or a byte array, you could convert that to a ISO-8859-1 stream (or byte array), and then feed it to "TextView". 但是，如果解析结果为字符串或字节数组，则可以将其转换为ISO-8859-1流（或字节数组），然后将其提供给“ TextView”。

使用SAX解析器解析包含umlaute的XML文件

问题描述

1 个解决方案

解决方案1
3 已采纳 2013-08-10 19:52:01

使用SAX解析器解析包含umlaute的XML文件

问题描述

1 个解决方案

解决方案1 3 已采纳 2013-08-10 19:52:01

解决方案1
3 已采纳 2013-08-10 19:52:01