如何将 Jsoup 文档转换为 W3C 文档？

Question

I have build a Jsoup Document by parsing a in-house HTML page,我通过解析内部 HTML 页面构建了一个 Jsoup 文档，

public Document newDocument(String path) throws IOException {

    Document doc = null;
    doc = Jsoup.connect(path).timeout(0).get();
            return new HtmlDocument<Document>(doc);
}

I would want to convert the Jsoup document to my org.w3c.dom.Document I used an available library DOMBuilder for this but when parsing I get org.w3c.dom.Document as null.我想将 Jsoup 文档转换为我的org.w3c.dom.Document我为此使用了可用的库DOMBuilder但是在解析时我得到org.w3c.dom.Document为空。 I am unable to understand the problem, tried searching but couldnt find any answer.我无法理解这个问题，尝试搜索但找不到任何答案。

Code to generate the W3C DOM Document :生成 W3C DOM 文档的代码：

Document jsoupDoc=factory.newDocument("http:localhost/testcases/test_2.html"));
org.w3c.dom.Document docu= DOMBuilder.jsoup2DOM(jsoupDoc);

Can anyone please help me on this?谁能帮我解决这个问题？

Answer 1

Alternatively, Jsoup provides the W3CDom class with the method fromJsoup . 可替代地，Jsoup提供的方法中的W3CDom类fromJsoup 。 This method transforms a Jsoup Document into a W3C document. 此方法将Jsoup文档转换为W3C文档。

Document jsoupDoc = ...
W3CDom w3cDom = new W3CDom();
org.w3c.dom.Document w3cDoc = w3cDom.fromJsoup(jsoupDoc);

UPDATE: 更新：

Since 1.10.3 W3CDom is no longer experimental . 从1.10.3开始， W3CDom 不再是实验性的。
Up to Jsoup 1.10.2 W3CDom class is still experimental. 直到Jsoup 1.10.2 W3CDom类仍然是实验性的。

Answer 2

To retrieve a jsoup document via HTTP , make a call to Jsoup.connect(...).get() . 要通过HTTP检索jsoup文档，请调用Jsoup.connect(...).get() 。 To load a jsoup document locally , make a call to Jsoup.parse(new File("..."), "UTF-8") . 要在本地加载jsoup文档，请调用Jsoup.parse(new File("..."), "UTF-8") 。

The call to DomBuilder is correct. 对DomBuilder的调用是正确的。

When you say, 当你说，

I used an available library DOMBuilder for this but when parsing I get org.w3c.dom.Document as null. 我使用了一个可用的DOMBuilder库，但在解析时我将org.w3c.dom.Document视为null。

I think you mean, "I used an available library, DOMBuilder, for this but when printing the result, I get [#document: null] ." 我认为你的意思是，“我使用了一个可用的库，DOMBuilder，但是在打印结果时，我得到[#document: null] 。” At least, that was the result I saw when I tried printing the w3cDoc object - but that doesn't mean the object is null. 至少，这是我在尝试打印w3cDoc对象时看到的结果 - 但这并不意味着该对象为null。 I was able to traverse the document by making calls to getDocumentElement and getChildNodes . 我能够通过调用getDocumentElement和getChildNodes来遍历文档。

public static void main(String[] args) {
    Document jsoupDoc = null;

    try {
        jsoupDoc = Jsoup.connect("http://stackoverflow.com/questions/17802445").get();
    } catch (IOException e) {
        e.printStackTrace();
    }

    org.w3c.dom.Document w3cDoc= DOMBuilder.jsoup2DOM(jsoupDoc);
    Element e = w3cDoc.getDocumentElement();
    NodeList childNodes = e.getChildNodes();
    Node n = childNodes.item(2);
    System.out.println(n.getNodeName());
}

Answer 3

I think there is a lot of updates happened till now (2022).我认为到目前为止（2022 年）发生了很多更新。

org.w3c.dom.Document document = W3CDom.convert(jsoupDoc);

this worked for me.这对我有用。

如何将 Jsoup 文档转换为 W3C 文档？

问题描述

2 个解决方案

解决方案1
19 2015-05-15 11:44:03

解决方案2
6 已采纳 2013-09-25 20:08:10

解决方案3
0 2022-01-27 21:53:13

如何将 Jsoup 文档转换为 W3C 文档？

问题描述

2 个解决方案

解决方案1 19 2015-05-15 11:44:03

解决方案2 6 已采纳 2013-09-25 20:08:10

解决方案3 0 2022-01-27 21:53:13

解决方案1
19 2015-05-15 11:44:03

解决方案2
6 已采纳 2013-09-25 20:08:10

解决方案3
0 2022-01-27 21:53:13