在Java中将html String转换为org.w3c.dom.Document

Question

To convert from HTML String to 要从HTML字符串转换为

org.w3c.dom.Document org.w3c.dom.Document中

I'm using 我正在使用

jtidy-r938.jar jtidy-r938.jar

here is my code: 这是我的代码：

public static Document getDoc(String html) {
        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-8");
        tidy.setOutputEncoding("UTF-8");
        tidy.setWraplen(Integer.MAX_VALUE);
        // tidy.setPrintBodyOnly(true);
        tidy.setXmlOut(false);
        tidy.setShowErrors(0);
        tidy.setShowWarnings(false);
        // tidy.setForceOutput(true);
        tidy.setQuiet(true);
        Writer out = new StringWriter();
        PrintWriter dummyOut = new PrintWriter(out);
        tidy.setErrout(dummyOut);
        tidy.setSmartIndent(true);
        ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
        Document doc = tidy.parseDOM(inputStream, null);
        return doc;
    }

But sometime the library work incorrectly, some tag is lost. 但是有时候库工作不正常，有些标签会丢失。

Please tell a good open library to do this task. 请告诉一个好的开放式库来完成这项任务。

Thanks very much! 非常感谢！

Answer 1

You don't tell why sometimes the library doesn't give the good result. 你没有告诉为什么有时候图书馆没有给出好的结果。 Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example. 不过，我经常使用html文件，我必须从中提取数据，遇到的主要问题是某些标签无效，因为例如没有关闭。 The best solution i found to resolve is the api htmlcleaner ( htmlCleaner Website ). 我发现解决的最佳解决方案是api htmlcleaner（ htmlCleaner Website ）。

It allows you to make your html file well formed. 它允许您使您的html文件格式良好。 Then, to transform it in document w3c or another strict format file is easier. 然后，在文档w3c或其他严格格式文件中转换它更容易。

With HtmlCleaner, you could do such as : 使用HtmlCleaner，您可以这样做：

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);

I refer DomSerializer from htmlcleaner. 我从htmlcleaner中引用了DomSerializer。

在Java中将html String转换为org.w3c.dom.Document

问题描述

1 个解决方案

解决方案1
3 已采纳 2015-06-07 11:02:49

在Java中将html String转换为org.w3c.dom.Document

问题描述

1 个解决方案

解决方案1 3 已采纳 2015-06-07 11:02:49

解决方案1
3 已采纳 2015-06-07 11:02:49