簡體   English   中英

為什么org.w3c.dom解析我的xml錯誤?

[英]Why is org.w3c.dom parsing my xml wrong?

解析以下xml后,

<html>
    <body>
        <a>
            <div>
                <span>foo</span>
            </div>
        </a>
    </body>
</html>

用javax.xml.xpath解析的org.w3c.dom文檔指示以下內容:

  • div是的父節點a
  • aspan的父節點

為什么會這樣,如何正確解析此xml?

這是我正在使用的代碼,其次是用於創建Document對象的方法,然后是代碼的輸出。

String myxml = ""
    + "<html>"
    + "<body>"
    + "<a>"
    + "<div>"
    + "<span>foo</span>"
    + "</div>"
    + "</a>"
    + "</body>"
    + "</html>";

Document doc = HttpDownloadUtilities.getWebpageDocument_fromSource(myxml);

XPath xPath = XPathFactory.newInstance().newXPath();

Node node = ((Node)xPath.compile("//*[text() = 'foo']").evaluate(doc, XPathConstants.NODE));

System.out.println("       node tag: " + node.getNodeName());
System.out.println("     parent tag: " + node.getParentNode().getNodeName());
System.out.println("grandparent tag: " + node.getParentNode().getParentNode().getNodeName());

Set<Node> nodes = H.getSet((NodeList)xPath.compile("//*").evaluate(doc, XPathConstants.NODESET));

for (Node n : nodes) {
    System.out.println();
    try {
        System.out.println("node: " + n.getNodeName());
    } catch (Exception e) {
    }
    try {
        System.out.println("child: " + n.getChildNodes().item(0).getNodeName());
    } catch (Exception e) {
    }
}

這是用於創建Document對象的方法:

public static Document getWebpageDocument_fromSource(String source) throws InterruptedException, IOException {
    try {
        HtmlCleaner cleaner = new HtmlCleaner();
        CleanerProperties props = cleaner.getProperties();
        props.setAllowHtmlInsideAttributes(true);
        props.setAllowMultiWordAttributes(true);
        props.setRecognizeUnicodeChars(true);
        props.setOmitComments(true);

        DocumentBuilderFactory builderFactory = DocumentBuilderFactory.newInstance();
        DocumentBuilder builder = null;
        try {
            builder = builderFactory.newDocumentBuilder();
        } catch (ParserConfigurationException e) {
            e.printStackTrace();
        }

        TagNode tagNode = new HtmlCleaner().clean(source);

        Document doc = new DomSerializer(new CleanerProperties()).createDOM(tagNode);

        return doc;
    } catch (ParserConfigurationException ex) {
        ex.printStackTrace();
        return null;
    }
}

輸出:

       node tag: span
     parent tag: a
grandparent tag: div

node: html
child: head

node: head

node: body
child: html

node: html
child: body

node: body
child: a

node: a

node: div
child: a

node: a
child: span

node: span
child: #text

html解析器最有可能修復無效的html。 在A標簽內部不允許使用div標簽。 一旦有了Document-object,就已經解析並修復了html。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM