解析漂亮的XML字符串產生奇怪的結果

Question

在我的應用程序中，我使用LSSerializer將XML文檔轉換為漂亮打印格式的字符串：

public static String convertDocumentToString(Document doc) {
    DOMImplementationLS domImplementation = (DOMImplementationLS) doc.getImplementation();
    LSSerializer        lsSerializer      = domImplementation.createLSSerializer();
    lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE); // Set this to true if the output needs to be beautified.
    return lsSerializer.writeToString(doc);   
}

在頁面1上，我有以下漂亮的XML字符串：

<result>
    <category catKey="school_level">
        <category catKey="primary">
            <category catKey="primary_1">
                <category catKey="math_primary_1"/>
                <category catKey="chinese_primary_1"/>
            </category>
            <category catKey="primary_2"/>
            <category catKey="primary_3"/>
        </category>
        <category catKey="jc"/>
    </category>
</result>

我使用以下方法來解析上述字符串：

public static Document parseXml(String xml)
        throws ParserConfigurationException, IOException, SAXException {
    DocumentBuilderFactory docFactory = DocumentBuilderFactory.newInstance();
    docFactory.setNamespaceAware(false);
    docFactory.setValidating(false);
    docFactory.setFeature("http://xml.org/sax/features/namespaces", false);
    docFactory.setFeature("http://xml.org/sax/features/validation", false);
    docFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-dtd-grammar", false);
    docFactory.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false);

    DocumentBuilder docBuilder = docFactory.newDocumentBuilder();
    Document               doc = docBuilder.parse(new InputSource(new StringReader(xml)));
    return doc;
}

這是我的測試功能：

public void test() {
    Document doc = Test.parseXml("pretty-print-XML-string");

    NodeList childList = result.getDocumentElement().getChildNodes();
    for (int j = 0 ; j < childList.getLength() ; j++) {
        System.out.println("TEST: " + childList.item(j));
    }
}

我預計只會看到1個category子節點。 但是，在控制台上，我看到了以下幾行：

INFO:   TEST 2: [#text: 
    ]
INFO:   TEST 2: [category: null]
INFO:   TEST 2: [#text: 
    ]
INFO:   TEST 2: [#text: 
]

如果刪除lsSerializer.getDomConfig().setParameter("format-pretty-print", Boolean.TRUE); 從convertDocumentToString函數中，所有那些[#text:]節點不再出現。

如果有人可以向我解釋為什么解析的文檔中有些[#text:]節點，我將不勝感激。 此外，請給我一些有關如何解析漂亮打印的XML字符串的建議。

Answer 1

為了漂亮地打印，在您提供的內容中添加了新的行和空格。

解析漂亮的打印XML時，您將獲得包含這些新行和空格的其他文本節點。

如果我沒記錯的話，您可以告訴DocumentBuilderFactory忽略空白節點。

Answer 2

空格（ \\n\\t ）是#text

只需跳過字符串值匹配\\\\s+文本節點和/或執行類似的操作

public String unPretty(String pretty) { 
  return pretty.replaceAll(">\\s+<","><");
}

解析漂亮的XML字符串產生奇怪的結果

問題描述

2 個解決方案

解決方案1
0 已采納 2014-02-27 16:09:40

解決方案2
0 2014-02-27 16:11:19

解析漂亮的XML字符串產生奇怪的結果

問題描述

2 個解決方案

解決方案1 0 已采納 2014-02-27 16:09:40

解決方案2 0 2014-02-27 16:11:19

解決方案1
0 已采納 2014-02-27 16:09:40

解決方案2
0 2014-02-27 16:11:19