简体   繁体   English

在ISO-8859-1编码的XML文档中解码Unicode字符

[英]Decoding of Unicode characters in a ISO-8859-1 encoded XML document

Using javax.xml.transform I created this ISO-8859-1 document which contains two &#-encoded characters and : 我使用javax.xml.transform创建了这个ISO-8859-1文档,其中包含两个&#编码的字符

<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>&#50108; and &#50102;</xml>

Question: how will a standards-compliant XML reader interpret the 쎼 and 쎶, 问题:符合标准的XML阅读器将如何解释쎼和쎶,

  • just as the plain &# ... strings (not converted back to and ) 就像普通的&#...字符串(不会转换回
  • as and 作为

Code to generate the XML: 生成XML的代码:

public void testInvalidCharacter() {
    try {
        String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
        System.out.println(str);

        DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
        Document doc = builder.newDocument();
        Element root = doc.createElement("xml");
        root.setTextContent(str);
        doc.appendChild(root);

        DOMSource domSource = new DOMSource(doc);

        Transformer transformer = TransformerFactory.newInstance().newTransformer();
        transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());

        StringWriter out = new StringWriter();
        transformer.transform(domSource, new StreamResult(out));

        System.out.println(out.toString());

    } catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
        e.printStackTrace(System.err);
    }
}

An XML Parser will recognize the '&#...' escape syntax and properly return 쎼 and 쎶 with its API for the text of the element. XML解析器将识别“&#...”转义语法,并正确返回쎼和쎶及其API来表示元素的文本。 Eg in Java the org.w3c.dom.Element.getTextContent() method for the Element with the tag Name 'xml' will return a String with that Unicode characters, though your XML document itself is ISO-8859-1 例如,在Java中,标签为“ xml”的Element的org.w3c.dom.Element.getTextContent()方法将返回带有该Unicode字符的String,尽管您的XML文档本身是ISO-8859-1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 动态SAX解析器,用于UTF-8或ISO-8859-1编码的XML - Dynamic SAX Parser for UTF-8 or ISO-8859-1 encoded XML Java使用正确的unicode字符将ISO-8859-1转换为UTF-8 - Java convert ISO-8859-1 to UTF-8 with correct unicode characters JavaMail无法正确解码ISO-8859-1邮件 - JavaMail not correctly decoding ISO-8859-1 mail 在UTF-8编码的代码中,使用带重音符号的字符串,该字符串取自以ISO-8859-1编码的文件 - In UTF-8 encoded code, use a string with accented characters taken from a file encoded in ISO-8859-1 从用“iso-8859-1”编码的浏览器发布的字符,但它应该是“UTF-8” - characters posted from browser encoded with “iso-8859-1” however it should be “UTF-8” 在Java中以XML保留从ISO-8859-1到UTF-8转换之间的unicode代码点 - Preserve unicode codepoints between ISO-8859-1 to UTF-8 conversions in XML in Java UTF-8和ISO-8859-1无法在Java中解码欧洲字符集 - UTF-8 & ISO-8859-1 not working for decoding European charset in Java 解码和编码字符串,ISO-8859-1 到 UTF-8 中 Java - decoding and encoding strings, ISO-8859-1 to UTF-8 in Java ISO-8859-1 在 Java 中将字符串从/转入 JSON - ISO-8859-1 encoded strings out of /into JSON in Java 如何使用WSRequest API处理ISO-8859-1编码的请求? - How to handle ISO-8859-1 encoded requests with the WSRequest API?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM