简体   繁体   English

从用Java中的SAX解析的XML属性值获取原始二进制数据

[英]Get raw binary data from XML attribute's value parsed with SAX in Java

I am parsing an XML document which contains text strings, obtained from various input text files with no information about their encoding, which are stored as attribute's values. 我正在解析一个包含文本字符串的XML文档,该文本字符串是从各种输入文本文件中获得的,而这些文本文件没有有关其编码的信息,这些字符串存储为属性值。 The XML document itself is generated with specific encoding, but the text strings are passed into XML document as binary data without any further information about what is their original encoding. XML文档本身是使用特定的编码生成的,但是文本字符串作为二进制数据传递到XML文档中,而没有任何有关其原始编码是什么的进一步信息。 The characters with ASCII value above 127 are escaped: ASCII值大于127的字符被转义:

<?xml version="1.0" encoding="ISO-8859-2" ?>
<Root>
  <Value val="&quot;&#xb5;&#xe0;&quot;"/>
</Root>

The whole XML document is encoded in ISO-8859-2 and the value of an attribute val of an element Value is: 整个XML文档均以ISO-8859-2进行编码,并且元素Value的属性val的值为:

"µà"

originally encoded in ISO-8859-1 and the byte representation according to PSPad HEX viewer is: 最初以ISO-8859-1编码,根据PSPad HEX查看器的字节表示为:

22 B5 E0 22

which can be also represented in ISO-8859-2 as: ISO-8859-2中也可以表示为:

"ľŕ"

The problem is, I want to parse it as ISO-8859-2 , but from the SAX parser is not possible to obtain non-normalized value. 问题是,我想将其解析为ISO-8859-2 ,但无法从SAX解析器获取非标准化值。 The attribute's value is possible to obtain in form of String object instance which already represents the text as: 可以以String对象实例的形式获取该属性的值,该对象实例已经将文本表示为:

"µà"

I tried to persuade the parser to parse the XML in ISO-8859-2 , but nothing changed: 我试图说服解析器解析ISO-8859-2中的XML,但是没有任何变化:

XMLReader parser = XMLReaderFactory.createXMLReader("org.apache.xerces.parsers.SAXParser");
MyHandler handler= new MyHandler(); // implementation of DefaultHandler
parser.setContentHandler(handler);
parser.setEntityResolver(handler);
InputStream instream = new FileInputStream("myFile.xml");
InputSource is = new InputSource(instream);
is.setEncoding("ISO-8859-2");
parser.parse(is);

I try to treat the String as UTF-16 and obtain bytes and then use these bytes to create the desired value: 我尝试将String视为UTF-16并获取字节,然后使用这些字节创建所需的值:

String val = attributes.getValue("val");
try{
      byte[] bytes = val.getBytes(StandardCharsets.UTF_16);
      ByteBuffer inputBuffer = ByteBuffer.wrap(bytes);
      CharBuffer chData = Charset.forName("ISO-8859-2").decode(inputBuffer);
} catch (UnsupportedEncodingException e) {
      System.out.println("Encoding not supported.")
}

but what I get is: 但是我得到的是:

 ţ˙ " ľ ŕ "

respectivelly: 分别:

 [-2, -1, 0, 34, 0, -75, 0, -32, 0, 34]

I am not sure, if this approach is the only right approach how to obtain the original binary representation of the text value. 我不确定,如果这种方法是唯一正确的方法,那么如何获取文本值的原始二进制表示形式。

Thank you for your advices. 感谢您的建议。

The problem is not SAX related, but instead just a problem how to convert a byte array to a ISO-8859-2 encoded string. 这个问题与SAX不相关,而仅仅是一个如何将字节数组转换为ISO-8859-2编码字符串的问题。 So you can use How to convert Strings to and from UTF8 byte arrays in Java to convert the string from the attribute to a byte array using one format (ISO-8859-1) and convert it back to string from another format (ISO-8859-2). 因此,您可以使用如何在Java中将字符串与UTF8字节数组之间来回转换,以使用一种格式(ISO-8859-1)将字符串从属性转换为字节数组,然后从另一种格式(ISO-8859)转换回字符串。 -2)。

String s = "\"µà\"";
System.out.println(s);
byte[] iso8859_1_bytes = s.getBytes(Charset.forName("ISO-8859-1"));
System.out.println(Arrays.toString(iso8859_1_bytes));
String conv = new String(iso8859_1_bytes, Charset.forName("ISO-8859-2"));
System.out.println(conv);

This will generate the following output: 这将生成以下输出:

"µà"
[34, -75, -32, 34]
"ľŕ"

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM