[英]Java XML serialization error: Invalid UTF-16 Surrogate detected
I have an org.w3c.dom.Document
and want to serialize it with this function, but I get an SAXException
.我有一个
org.w3c.dom.Document
并想用这个函数序列化它,但我得到一个SAXException
。 How could I fix this?我怎么能解决这个问题?
public static String serializeXmlDocument(Document document) throws Exception
{
// set up a transformer
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer trans = transformerFactory.newTransformer();
trans.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
trans.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
trans.setOutputProperty(OutputKeys.INDENT, "yes");
DOMSource source = new DOMSource(document);
// create string from xml tree
StringWriter stringWriter = new StringWriter();
StreamResult stringResult = new StreamResult(stringWriter);
trans.transform(source, stringResult);
return stringWriter.toString();
}
This results in the following error:这会导致以下错误:
2014-07-20 03:03:36,451 ERROR [XXX] XXX main job error:
javax.xml.transform.TransformerException: org.xml.sax.SAXException: E/A-Fehler
java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:758)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:359)
at mypackage.handler.XmlHandler.serializeXmlDocument(XmlHandler.java:226)
at mypackage.subpackage.buildSolrXml(MyJob.java:213)
at mypackage.subpackage.doJob(MyJob.java:113)
at mypackage.MyWorkstation.main(MyWorkstation.java:27)
Caused by: org.xml.sax.SAXException: E/A-Fehler
java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
at com.sun.org.apache.xml.internal.serializer.ToStream.cdata(ToStream.java:1290)
at com.sun.org.apache.xml.internal.serializer.ToStream.characters(ToStream.java:1395)
at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.characters(ToUnknownStream.java:814)
at com.sun.org.apache.xml.internal.serializer.ToUnknownStream.characters(ToUnknownStream.java:348)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:122)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:230)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:136)
at com.sun.org.apache.xalan.internal.xsltc.trax.DOM2TO.parse(DOM2TO.java:98)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transformIdentity(TransformerImpl.java:702)
at com.sun.org.apache.xalan.internal.xsltc.trax.TransformerImpl.transform(TransformerImpl.java:746)
... 5 more
Caused by: java.io.IOException: Ungültige UTF-16-Ersetzung festgestellt: d835 20 ?
at com.sun.org.apache.xml.internal.serializer.ToStream.writeUTF16Surrogate(ToStream.java:973)
at com.sun.org.apache.xml.internal.serializer.ToStream.writeNormalizedChars(ToStream.java:1110)
at com.sun.org.apache.xml.internal.serializer.ToStream.cdata(ToStream.java:1267)
... 16 more
This is not always caused by invalid UTF-16 characters.这并不总是由无效的 UTF-16 字符引起的。 If a multi-byte UTF-8/16/32 character crosses a 1024 byte boundary anywhere in the
Stream
, the Xalan XSLTC processor will split the character into two pieces, which results in two incorrect characters being generated and (in most cases) will produce the above error.如果多字节 UTF-8/16/32 字符跨越
Stream
任何位置的 1024 字节边界,Xalan XSLTC 处理器会将字符分成两部分,这会导致生成两个不正确的字符,并且(在大多数情况下)产生上述错误。
This is due to a Xalan bug (1024-byte buffers), which will be fixed in OpenJDK 12.这是由于Xalan 错误(1024 字节缓冲区)造成的,该错误将在 OpenJDK 12 中修复。
The simplest file that triggers this bug is:触发此错误的最简单文件是:
<?xml version="1.0" ?><x>xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx𝜃</x>
Update (April 9, 2021): It looks like this was "fixed" in Java 8u251 or 8u222 and 11.0.7.更新(2021 年 4 月 9 日):这似乎已在 Java 8u251 或 8u222 和 11.0.7 中“修复”。 However, while the error is avoided, it looks like the character in question is ignored by the parser.
但是,虽然避免了错误,但解析器似乎忽略了相关字符。
The Document contained invalid Unicode characters like文档包含无效的 Unicode 字符,例如
http://www.fileformat.info/info/unicode/char/d835/index.htm http://www.fileformat.info/info/unicode/char/d835/index.htm
I fixed it with the solution from removing invalid XML characters from a string in java我使用从java 中的字符串中删除无效 XML 字符的解决方案修复了它
// remove illegal unicode characters
String xml10pattern = "[^"
+ "\u0009\r\n"
+ "\u0020-\uD7FF"
+ "\uE000-\uFFFD"
+ "\ud800\udc00-\udbff\udfff"
+ "]";
stringValue = stringValue.replaceAll(xml10pattern, " ");
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.