简体   繁体   English

如何将原始的Latin-1 char []从SAX解析器转换为正确的UTF-8字符串?

[英]How to convert an originally Latin-1 char[] from SAX parser to a proper UTF-8 String?

I've been trying to use the Java SAX parser to parse an XML file in the ISO-8859-1 character encoding. 我一直在尝试使用Java SAX解析器来解析ISO-8859-1字符编码中的XML文件。 This goes otherwise well, but the special characters such as ä and ö are giving me a headache. 这不是很好,但是ä和ö这样的特殊角色给我带来了麻烦。 In short, the ContentHandler.characters(...) method gives me weird characters, and you cannot even use a char array to construct a String with a specified encoding. 简而言之, ContentHandler.characters(...)方法给了我奇怪的字符,你甚至不能使用char数组来构造具有指定编码的String。

Here's a complete minimum working example in two files: 这是两个文件中的完整最小工作示例:

latin1.xml: latin1.xml:

<?xml version='1.0' encoding='ISO-8859-1' standalone='no' ?>
<x>Motörhead</x>

That file is saved in the said Latin-1 format, so hexdump gives this: 该文件以所述Latin-1格式保存,因此hexdump给出了:

$ hexdump -C latin1.xml 
00000000  3c 3f 78 6d 6c 20 76 65  72 73 69 6f 6e 3d 27 31  |<?xml version='1|
00000010  2e 30 27 20 65 6e 63 6f  64 69 6e 67 3d 27 49 53  |.0' encoding='IS|
00000020  4f 2d 38 38 35 39 2d 31  27 20 73 74 61 6e 64 61  |O-8859-1' standa|
00000030  6c 6f 6e 65 3d 27 6e 6f  27 20 3f 3e 0a 3c 78 3e  |lone='no' ?>.<x>|
00000040  4d 6f 74 f6 72 68 65 61  64 3c 2f 78 3e           |Mot.rhead</x>|

So the "ö" is encoded with a single byte, f6, as you'd expect. 所以“ö”用单个字节f6编码,正如你所期望的那样。

Then, here's the Java file, saved in the UTF-8 format: 然后,这是以UTF-8格式保存的Java文件:

MySAXHandler.java: MySAXHandler.java:

import java.io.File;
import java.io.FileReader;
import javax.xml.parsers.SAXParser;
import javax.xml.parsers.SAXParserFactory;
import org.xml.sax.InputSource;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

public class MySAXHandler extends DefaultHandler {
private static final String FILE = "latin1.xml"; // Edit this to point to the correct file

@Override
public void characters(char[] ch, int start, int length) {
    char[] dstCharArray = new char[length];
    System.arraycopy(ch, start, dstCharArray, 0, length);
    String strValue = new String(dstCharArray);
    System.out.println("Read: '"+strValue+"'");
    assert("Motörhead".equals(strValue));
}

private XMLReader getXMLReader() {
    try {
        SAXParser saxParser = SAXParserFactory.newInstance().newSAXParser();
        XMLReader xmlReader = saxParser.getXMLReader();
        xmlReader.setContentHandler(new MySAXHandler());
        return xmlReader;
    } catch (Exception ex) {
        throw new RuntimeException("Epic fail.", ex);
    }
}

public void go() {
    try {
        XMLReader reader = getXMLReader();
        reader.parse(new InputSource(new FileReader(new File(FILE))));
    } catch (Exception ex) {
        throw new RuntimeException("The most epic fail.", ex);
    }
}

public static void main(String[] args) {
    MySAXHandler tester = new MySAXHandler();
    tester.go();
}
}

The result of running this program is that it outputs Read: 'Mot rhead' (ö replaced with a "? in a box") and then crashes due to an assertion error. 运行该程序的结果是输出Read: 'Mot rhead' (ö替换为“?in a box”)然后由于断言错误而崩溃。 If you look into the char array, you'll see that the char that encodes the letter ö consists of three bytes. 如果查看char数组,您将看到编码字母char的char由三个字节组成。 They don't make any sense to me, as in UTF-8 an ö should be encoded with two bytes. 它们对我没有任何意义,因为在UTF-8中,应该用两个字节编码。

What I have tried 我试过了什么

I have tried converting the character array to a string, then getting the bytes of that string to pass to another string constructor with a charset encoding parameter. 我已经尝试将字符数组转换为字符串,然后将该字符串的字节传递给另一个带有charset编码参数的字符串构造函数。 I have also played with CharBuffers and tried to find something that might possibly work with the Locale class to solve this problem, but nothing I try seems to work. 我也玩过CharBuffers并试图找到可能与Locale类一起使用的东西来解决这个问题,但我尝试的东西似乎都没有用。

The problem is that you're using a FileReader to read the file, instead of a FileInputStream as a commenter previously suggested. 问题是你正在使用FileReader来读取文件,而不是像以前建议的评论者那样使用FileInputStream。 In the go method, take out the FileReader and replace with FileInputStream . go方法中,取出FileReader并替换为FileInputStream

public void go() {
    try {
        XMLReader reader = getXMLReader();
        reader.parse(new InputSource(new FileInputStream(new File(FILE))));
    } catch (Exception ex) {
        throw new RuntimeException("The most epic fail.", ex);
    }
}

The way you have it now, FileReader uses the default platform encoding to decode the characters before passing them to the SAX parser, which is not what you want. 现在的方式, FileReader使用默认的平台编码来解码字符,然后再将它们传递给SAX解析器,这不是你想要的。 If you replace with FileInputStream , then the XML parser should correctly read the processing instruction with the character set encoding, and handle the character set decoding for you. 如果用FileInputStream替换,则XML解析器应该使用字符集编码正确读取处理指令,并为您处理字符集解码。

Because FileReader is doing the decoding, you're seeing the invalid characters. 因为FileReader正在进行解码,所以您会看到无效字符。 If you let the SAX parser handle it, it should go through fine. 如果你让SAX解析器处理它,它应该会很好。

In the characters() method: 在characters()方法中:

When you construct a new String object, First convert your char[] into a byte[], then invoke the constructor 'new String(byte[], String charSetName)', instead of the default 'new String(char [])' 构造一个新的String对象时,首先将char []转换为byte [],然后调用构造函数'new String(byte [],String charSetName)',而不是默认的'new String(char [])'

If you need more help, try: http://www.exampledepot.com/egs/java.nio.charset/ConvertChar.html 如果您需要更多帮助,请尝试: http//www.exampledepot.com/egs/java.nio.charset/ConvertChar.html

You are fishing in murky waters; 你在浑水中钓鱼; many things are misleading. 很多事情都是误导。 As @JBNizet indicated: a Reader reads text in some encoding, already doing a conversion on an InputStream which reads bytes. 正如@JBNizet指出的那样:Reader以某种编码方式读取文本,已经在读取字节的InputStream上进行转换。 If you do not indicate the encoding the platform encoding will be taken. 如果您未指明编码,则将采用平台编码。

    reader.parse(new InputSource(new FileInputStream(new File(FILE))));

This is neutral to the actual encoding attribute in the XML. 这与XML中的实际编码属性无关。

The java source encoding must coincide with the editor encoding, otherwise the string literal would go wrong. java源代码编码必须与编辑器编码一致,否则字符串文字会出错。

System.out.println can be misrepresenting too. System.out.println也可能被误传。

Furthermore "ISO-8859-1" is a subset of Windows Latin-1, "Windows-1252". 此外,“ISO-8859-1”是Windows Latin-1“Windows-1252”的子集。 If you ever encounter problems with special characters propose "Windows-1252" (in java one can use "Cp1252"). 如果您遇到特殊字符问题,建议使用“Windows-1252”(在java中可以使用“Cp1252”)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM