简体   繁体   中英

Get raw binary data from XML attribute's value parsed with SAX in Java

I am parsing an XML document which contains text strings, obtained from various input text files with no information about their encoding, which are stored as attribute's values. The XML document itself is generated with specific encoding, but the text strings are passed into XML document as binary data without any further information about what is their original encoding. The characters with ASCII value above 127 are escaped:

<?xml version="1.0" encoding="ISO-8859-2" ?>
<Root>
  <Value val="&quot;&#xb5;&#xe0;&quot;"/>
</Root>

The whole XML document is encoded in ISO-8859-2 and the value of an attribute val of an element Value is:

"µà"

originally encoded in ISO-8859-1 and the byte representation according to PSPad HEX viewer is:

22 B5 E0 22

which can be also represented in ISO-8859-2 as:

"ľŕ"

The problem is, I want to parse it as ISO-8859-2 , but from the SAX parser is not possible to obtain non-normalized value. The attribute's value is possible to obtain in form of String object instance which already represents the text as:

"µà"

I tried to persuade the parser to parse the XML in ISO-8859-2 , but nothing changed:

XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
MyHandler handler= new MyHandler(); // implementation of DefaultHandler
parser.setContentHandler(handler);
parser.setEntityResolver(handler);
InputStream instream = new FileInputStream("myFile.xml");
InputSource is = new InputSource(instream);
is.setEncoding("ISO-8859-2");
parser.parse(is);

I try to treat the String as UTF-16 and obtain bytes and then use these bytes to create the desired value:

String val = attributes.getValue("val");
try{
      byte[] bytes = val.getBytes(StandardCharsets.UTF_16);
      ByteBuffer inputBuffer = ByteBuffer.wrap(bytes);
      CharBuffer chData = Charset.forName("ISO-8859-2").decode(inputBuffer);
} catch (UnsupportedEncodingException e) {
      System.out.println("Encoding not supported.")
}

but what I get is:

 ţ˙ " ľ ŕ "

respectivelly:

 [-2, -1, 0, 34, 0, -75, 0, -32, 0, 34]

I am not sure, if this approach is the only right approach how to obtain the original binary representation of the text value.

Thank you for your advices.

The problem is not SAX related, but instead just a problem how to convert a byte array to a ISO-8859-2 encoded string. So you can use How to convert Strings to and from UTF8 byte arrays in Java to convert the string from the attribute to a byte array using one format (ISO-8859-1) and convert it back to string from another format (ISO-8859-2).

String s = "\"µà\"";
System.out.println(s);
byte[] iso8859_1_bytes = s.getBytes(Charset.forName("ISO-8859-1"));
System.out.println(Arrays.toString(iso8859_1_bytes));
String conv = new String(iso8859_1_bytes, Charset.forName("ISO-8859-2"));
System.out.println(conv);

This will generate the following output:

"µà"
[34, -75, -32, 34]
"ľŕ"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM