在ISO-8859-1编码的XML文档中解码Unicode字符

Question

Using javax.xml.transform I created this ISO-8859-1 document which contains two &#-encoded characters 쎼 and 쎶 : 我使用javax.xml.transform创建了这个ISO-8859-1文档，其中包含两个＆＃编码的字符쎼和쎶 ：

<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>&#50108; and &#50102;</xml>

Question: how will a standards-compliant XML reader interpret the 쎼 and 쎶, 问题：符合标准的XML阅读器将如何解释쎼和쎶，

just as the plain &# ... strings (not converted back to 쎼 and 쎶 ) 就像普通的＆＃...字符串（不会转换回쎼和쎶 ）
as 쎼 and 쎶 作为쎼和쎶

Code to generate the XML: 生成XML的代码：

public void testInvalidCharacter() {
    try {
        String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
        System.out.println(str);

        DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = builder.newDocument();
        Element root = doc.createElement("xml");
        root.setTextContent(str);
        doc.appendChild(root);

        DOMSource domSource = new DOMSource(doc);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());

        StringWriter out = new StringWriter();
        transformer.transform(domSource, new StreamResult(out));

        System.out.println(out.toString());

    } catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
        e.printStackTrace(System.err);
    }
}

Answer 1

An XML Parser will recognize the '&#...' escape syntax and properly return 쎼 and 쎶 with its API for the text of the element. XML解析器将识别“＆＃...”转义语法，并正确返回쎼和쎶及其API来表示元素的文本。 Eg in Java the org.w3c.dom.Element.getTextContent() method for the Element with the tag Name 'xml' will return a String with that Unicode characters, though your XML document itself is ISO-8859-1 例如，在Java中，标签为“ xml”的Element的org.w3c.dom.Element.getTextContent（）方法将返回带有该Unicode字符的String，尽管您的XML文档本身是ISO-8859-1

在ISO-8859-1编码的XML文档中解码Unicode字符

问题描述

1 个解决方案

解决方案1
1 已采纳 2016-06-01 08:41:19

在ISO-8859-1编码的XML文档中解码Unicode字符

问题描述

1 个解决方案

解决方案1 1 已采纳 2016-06-01 08:41:19

解决方案1
1 已采纳 2016-06-01 08:41:19