简体   繁体   English

在Java中将html String转换为org.w3c.dom.Document

[英]Convert html String to org.w3c.dom.Document in Java

To convert from HTML String to 要从HTML字符串转换为

org.w3c.dom.Document org.w3c.dom.Document中

I'm using 我正在使用

jtidy-r938.jar jtidy-r938.jar

here is my code: 这是我的代码:

public static Document getDoc(String html) {
        Tidy tidy = new Tidy();
        tidy.setInputEncoding("UTF-8");
        tidy.setOutputEncoding("UTF-8");
        tidy.setWraplen(Integer.MAX_VALUE);
        // tidy.setPrintBodyOnly(true);
        tidy.setXmlOut(false);
        tidy.setShowErrors(0);
        tidy.setShowWarnings(false);
        // tidy.setForceOutput(true);
        tidy.setQuiet(true);
        Writer out = new StringWriter();
        PrintWriter dummyOut = new PrintWriter(out);
        tidy.setErrout(dummyOut);
        tidy.setSmartIndent(true);
        ByteArrayInputStream inputStream = new ByteArrayInputStream(html.getBytes());
        Document doc = tidy.parseDOM(inputStream, null);
        return doc;
    }

But sometime the library work incorrectly, some tag is lost. 但是有时候库工作不正常,有些标签会丢失。

Please tell a good open library to do this task. 请告诉一个好的开放式库来完成这项任务。

Thanks very much! 非常感谢!

You don't tell why sometimes the library doesn't give the good result. 你没有告诉为什么有时候图书馆没有给出好的结果。 Nevertheless, i am working very regularly with html files where I must extract data from and the main problem encountered is that fact that some tags are not valid because not closed for example. 不过,我经常使用html文件,我必须从中提取数据,遇到的主要问题是某些标签无效,因为例如没有关闭。 The best solution i found to resolve is the api htmlcleaner ( htmlCleaner Website ). 我发现解决的最佳解决方案是api htmlcleaner( htmlCleaner Website )。

It allows you to make your html file well formed. 它允许您使您的html文件格式良好。 Then, to transform it in document w3c or another strict format file is easier. 然后,在文档w3c或其他严格格式文件中转换它更容易。

With HtmlCleaner, you could do such as : 使用HtmlCleaner,您可以这样做:

HtmlCleaner cleaner = new HtmlCleaner();
TagNode node = cleaner.clean(html);
DomSerializer ser = new DomSerializer(cleaner.getProperties());
Document myW3cDoc = ser.createDOM(node);

I refer DomSerializer from htmlcleaner. 我从htmlcleaner中引用了DomSerializer。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM