為什么org.w3c.dom解析我的xml錯誤？

Question

解析以下xml后，

<html>
    <body>
        <a>
            <div>
                <span>foo</span>
            </div>
        </a>
    </body>
</html>

用javax.xml.xpath解析的org.w3c.dom文檔指示以下內容：

div是的父節點a
a是span的父節點

為什么會這樣，如何正確解析此xml？

這是我正在使用的代碼，其次是用於創建Document對象的方法，然后是代碼的輸出。

String myxml = ""
    + "<html>"
    + "<body>"
    + "<a>"
    + "<div>"
    + "<span>foo</span>"
    + "</div>"
    + "</a>"
    + "</body>"
    + "</html>";

Document doc = HttpDownloadUtilities.getWebpageDocument_fromSource(myxml);

XPath xPath = XPathFactory.newInstance().newXPath();

Node node = ((Node)xPath.compile("//*[text() = 'foo']").evaluate(doc, XPathConstants.NODE));

System.out.println("       node tag: " + node.getNodeName());
System.out.println("     parent tag: " + node.getParentNode().getNodeName());
System.out.println("grandparent tag: " + node.getParentNode().getParentNode().getNodeName());

Set<Node> nodes = H.getSet((NodeList)xPath.compile("//*").evaluate(doc, XPathConstants.NODESET));

for (Node n : nodes) {
    System.out.println();
    try {
        System.out.println("node: " + n.getNodeName());
    } catch (Exception e) {
    }
    try {
        System.out.println("child: " + n.getChildNodes().item(0).getNodeName());
    } catch (Exception e) {
    }
}

這是用於創建Document對象的方法：

public static Document getWebpageDocument_fromSource(String source) throws InterruptedException, IOException {
    try {
        HtmlCleaner cleaner = new HtmlCleaner();
        CleanerProperties props = cleaner.getProperties();
        props.setAllowHtmlInsideAttributes(true);
        props.setAllowMultiWordAttributes(true);
        props.setRecognizeUnicodeChars(true);
        props.setOmitComments(true);

        DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = null;
        try {
            builder = builderFactory.newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }

        TagNode tagNode = new HtmlCleaner().clean(source);

        Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

        return doc;
    } catch (ParserConfigurationException ex) {
        ex.printStackTrace();
        return null;
    }
}

輸出：

       node tag: span
     parent tag: a
grandparent tag: div

node: html
child: head

node: head

node: body
child: html

node: html
child: body

node: body
child: a

node: a

node: div
child: a

node: a
child: span

node: span
child: #text

Answer 1

html解析器最有可能修復無效的html。 在A標簽內部不允許使用div標簽。 一旦有了Document-object，就已經解析並修復了html。

為什么org.w3c.dom解析我的xml錯誤？

問題描述

1 個解決方案

解決方案1
2 已采納 2015-07-31 21:30:32

為什么org.w3c.dom解析我的xml錯誤？

問題描述

1 個解決方案

解決方案1 2 已采納 2015-07-31 21:30:32

解決方案1
2 已采納 2015-07-31 21:30:32