[英]Decoding of Unicode characters in a ISO-8859-1 encoded XML document
我使用javax.xml.transform創建了這個ISO-8859-1文檔,其中包含兩個&#編碼的字符쎼
和쎶
:
<?xml version="1.0" encoding="ISO-8859-1"?>
<xml>쎼 and 쎶</xml>
問題:符合標准的XML閱讀器將如何解釋쎼和쎶,
쎼
和쎶
) 쎼
和쎶
生成XML的代碼:
public void testInvalidCharacter() {
try {
String str = "\uC3BC and \uC3B6"; // 쎼 and 쎶
System.out.println(str);
DocumentBuilder builder = DocumentBuilderFactory.newInstance().newDocumentBuilder();
Document doc = builder.newDocument();
Element root = doc.createElement("xml");
root.setTextContent(str);
doc.appendChild(root);
DOMSource domSource = new DOMSource(doc);
Transformer transformer = TransformerFactory.newInstance().newTransformer();
transformer.setOutputProperty(OutputKeys.ENCODING, StandardCharsets.ISO_8859_1.name());
StringWriter out = new StringWriter();
transformer.transform(domSource, new StreamResult(out));
System.out.println(out.toString());
} catch (ParserConfigurationException | DOMException | IllegalArgumentException | TransformerException e) {
e.printStackTrace(System.err);
}
}
XML解析器將識別“&#...”轉義語法,並正確返回쎼和쎶及其API來表示元素的文本。 例如,在Java中,標簽為“ xml”的Element的org.w3c.dom.Element.getTextContent()方法將返回帶有該Unicode字符的String,盡管您的XML文檔本身是ISO-8859-1
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.