简体   繁体   中英

How to encode UTF-8 for XML document in Java

I have a java program that read some stuff out of an excel sheet and creates XML.

Long story short I need the string contained within the XML to only contain valid XML characters and to properly encode and characters that need to be encoded.

Question: How can I encode these characters in java before writing to the file?

Thanks!

Note: These are characters such as: “ (“) and ” (”) and other similar characters.

As I understand your question you want to write XML in UTF-8 format, to write a file in UTF-8 following is kind of standard way in Java using OutputStreamWriter :

File f = new File("test.xml");
BufferedWriter wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(f), "UTF-8"));
wr.write("xml text here");

UTF-8 is a variable width encoding which can represent every character Unicode character set, see http://en.wikipedia.org/wiki/UTF-8#Description and http://en.wikipedia.org/wiki/Quotation_mark#Smart_quotes .

Further in your case seems like you want to convert “ to " and hoping that during UTF-8 conversion this would be handled (I might be wrong but this is what I perceived from your response). Are you saying that in XSL have " character but XML has “ ? If so then its a different problem then what's being discussed.

Edit: Just to clarify, I don't see any problem if XSL has “ and written XML also has same character as far as UTF-8 encoding is concerned.

Following XML is a valid XML containing Unicode characters:

<?xml version="1.0" encoding="UTF-8"?>
<root>
<summary>This is a summary, text may contain &#x201C;Unicode&#x201D; characters</summary>
</root>

Open in any browser, if the charset is supported XML would be rendered correctly otherwise in case of non-XML complaint characters following error would be thrown (at least in Chrome, might depend upon browser):

CharRef: invalid decimal value

For ranges of valid charset in XML you can refer: http://www.w3.org/TR/REC-xml/#charsets

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
/* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

For non-compliant XML charset refer: http://www.w3.org/TR/unicode-xml/#Charlist

Similar to writing <,>," in an XML as these needs to be written as &lt,&gt,&quot, Unicode characters needs to be written in following way: &#xNNNN; where NNNN is Unicode hexadecimal number. Refer above sample XML.

So while programatically writing XML, you need to handle such characters explicitly as when you encounter such characters convert it into &#x form.

Whenever reading a file or writing a file, be sure to define the encoding and use UTF-8. Be careful, because all this methods do also exist without encoding string and in this case, the OS default encoding is used.

E. g. use

InputStreamReader myReader=InputStreamReader(inputStream,"UTF-8");

instead of the constructor without carset name.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM