简体   繁体   中英

Convert UTF-8 to ISO-8859-1 with Numeric Character Reference

I get xml from third party with encoding UTF-8 and I need to send it to another third party but with ISO-8859-1 encoding. In xml there are many different languages eg Russian in cyrillic. I know that it's technically impossible to directly convert UTF-8 into ISO-8859-1 however I found StringEscapeUtils.escapeXML() but when using this method then the whole xml is converted even < , > and so on and I would only convert cyrillic to character number reference. Is such method exists in Java or it always parse whole xml? Is there another possibility to parse only characters which can't be encoded in ISO-8859-1 to number format reference?

I've seen similar questions on SO like: How do I convert between ISO-8859-1 and UTF-8 in Java? but it's without mentioning number format reference

UPDATE: Removed unnecessary DOM loading.

Use the XML transformer. It knows how to XML escape characters that are not supported by the given encoding.

Example

Transformer transformer = TransformerFactory.newInstance().newTransformer();

// Convert XML file to UTF-8 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
transformer.transform(new StreamSource(new File("test.xml")),
                      new StreamResult(new File("test-utf8.xml")));

// Convert XML file to ISO-8859-1 encoding
transformer.setOutputProperty(OutputKeys.ENCODING, "ISO-8859-1");
transformer.transform(new StreamSource(new File("test.xml")),
                      new StreamResult(new File("test-8859-1.xml")));

test.xml (input, UTF-8)

<?xml version="1.0" encoding="UTF-8"?>
<test>
  <english>Hello World</english>
  <portuguese>Olá Mundo</portuguese>
  <czech>Ahoj světe</czech>
  <russian>Привет мир</russian>
  <chinese>你好,世界</chinese>
  <emoji>👋 🌎</emoji>
</test>

Translated by https://translate.google.com (except emoji)

test-utf8.xml (output, UTF-8)

<?xml version="1.0" encoding="UTF-8"?><test>
  <english>Hello World</english>
  <portuguese>Olá Mundo</portuguese>
  <czech>Ahoj světe</czech>
  <russian>Привет мир</russian>
  <chinese>你好,世界</chinese>
  <emoji>&#128075; &#127758;</emoji>
</test>

test-8859-1.xml (output, ISO-8859-1)

<?xml version="1.0" encoding="ISO-8859-1"?><test>
  <english>Hello World</english>
  <portuguese>Olá Mundo</portuguese>
  <czech>Ahoj sv&#283;te</czech>
  <russian>&#1055;&#1088;&#1080;&#1074;&#1077;&#1090; &#1084;&#1080;&#1088;</russian>
  <chinese>&#20320;&#22909;&#65292;&#19990;&#30028;</chinese>
  <emoji>&#128075; &#127758;</emoji>
</test>

If you replace the test.xml with the test-8859-1.xml file (copy/paste/rename), you still get the same outputs, since the parser both auto-detects the encoding and unescapes all the escaped characters.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM