Java XML解析-使用VTD-XML的数据的字符串版本不正确

Question

I am parsing an XML document in UTF-8 encoding with Java using VTD-XML. 我正在使用VTD-XML用Java解析UTF-8编码的XML文档。

A small excerpt looks like: 一个小片段摘录如下：

<literal>𠀋</literal>
<literal>𠂉</literal>
<literal>𠂢</literal>

I want to iterate through each literal and print it out to the console. 我想遍历每个文字并将其输出到控制台。 However, what I get is: 但是，我得到的是：

¢

I am correctly navigating to each element. 我正确地导航到每个元素。 The way that I get the text value is by calling: 我获得文本值的方式是通过调用：

private static String toNormalizedString(String name, int val, final VTDNav vn) throws NavException {
    String strValue = null;
    if (val != -1) {
        strValue = vn.toNormalizedString(val);
    }
    return strValue;
}

I've also tried vn.getXPathStringVal(); 我也尝试过vn.getXPathStringVal(); , however it yields the same results. ，但结果相同。

I know that each of the literals above aren't just strings of length one. 我知道上面的每个文字不只是长度为一的字符串。 Rather, they seem to be unicode "characters" composed of two characters. 而是，它们似乎是由两个字符组成的unicode“字符”。 I am able to correctly parse and output the kanji characters if they're length is just one. 如果它们的长度仅仅是一个，我就能正确地解析和输出汉字字符。

My question is - how can I correctly parse and output these characters using VTD-XML? 我的问题是-如何使用VTD-XML正确解析和输出这些字符？ Is there a way to get the underlying bytes of the text between the literal tags so that I can parse the bytes myself? 有没有办法获取文字标签之间文本的基础字节，以便我自己解析这些字节？

EDIT 编辑

Code to process each line of the XML - converting it to a byte array and then back to a String. 处理XML每一行的代码-将其转换为字节数组，然后转换为String。

try (BufferedReader br = new BufferedReader(new FileReader("res/sample.xml"))) {
        String line;
        while ((line = br.readLine()) != null) {
            byte[] myBytes = null;

            try {
                myBytes = line.getBytes("UTF-8");
            } catch (UnsupportedEncodingException e) {
                e.printStackTrace();
                System.exit(-1);
            }

            System.out.println(new String(myBytes));
        }
    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }

Answer 1

You are probably trying to get the string involving characters that is greater than 0x10000. 您可能正在尝试获取包含大于0x10000的字符的字符串。 That bug is known and is in the process of being addressed... I will notify you once the fix is out. 该错误是已知的，正在解决中...一旦修复，我将通知您。 This question may be identical to this one... Map supplementary Unicode characters to BMP (if possible) 这个问题可能与这个问题相同。将补充Unicode字符映射到BMP（如果可能）

Java XML解析-使用VTD-XML的数据的字符串版本不正确

问题描述

1 个解决方案

解决方案1
2 已采纳 2017-07-05 21:57:29

Java XML解析-使用VTD-XML的数据的字符串版本不正确

问题描述

1 个解决方案

解决方案1 2 已采纳 2017-07-05 21:57:29

解决方案1
2 已采纳 2017-07-05 21:57:29