简体   繁体   English

可以解析HTML文档并构建DOM树(java)

[英]Possible to parse a HTML document and build a DOM tree(java)

Is it possible and what tools could be used to parse an html document as a string or from a file and then to construct a DOM tree so that a developer can walk the tree through some API. 是否可能以及可以使用哪些工具将html文档解析为字符串或文件,然后构造DOM树,以便开发人员可以通过一些API遍历该树。

For example: 例如:

DomRoot = parse("myhtml.html");

for (tags : DomRoot) {
}

Note: this is a HTML document not XHtml. 注意:这是HTML文档,而不是XHtml。

You can use TagSoup - it is a SAX Compliant parser that can clean malformed content such as HTML from generic web pages into well-formed XML. 您可以使用TagSoup-它是一种符合SAX的解析器,可以将格式错误的内容(例如HTML)从常规网页清除为格式正确的XML。

This is <B>bold, <I>bold italic, </b>italic, </i>normal text

gets correctly rewritten as:

This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.

JTidy should let you do what you want. JTidy应该让您做自己想做的事。

Usage is fairly straight forward, but parsing is configurable. 用法相当简单,但是解析是可配置的。 eg: 例如:

InputStream in = ...;
Tidy tidy = new Tidy();
// configure Tidy instance as required
...
...
Document doc = tidy.parseDOM(in, null);
Element root = doc.getDocumentElement();

The JavaDoc is hosted here . JavaDoc托管在这里

You can take a look at NekoHTML , a Java library that performs a best effort cleaning and tag balancing in your document. 您可以看一下NekoHTML ,这是一个Java库,可以在文档中尽最大努力清除和平衡标签。 It is an easy way to parse a malformed HTML (or a non-valid XML) file. 这是解析格式错误的HTML(或无效XML)文件的简便方法。

It is distributed under the Apache 2.0 license. 它根据Apache 2.0许可证分发。

HTML Parser seems to support conversion from HTML to XML. HTML Parser似乎支持从HTML到XML的转换。 Then you can build a DOM tree using the usual Java toolchain. 然后,您可以使用通常的Java工具链来构建DOM树。

There are several open source tools to parse HTML from Java. 有几种开源工具可以从Java解析HTML。

Check http://java-source.net/open-source/html-parsers 检查http://java-source.net/open-source/html-parsers

Also you can check answers to this question: Reading HTML file to DOM tree using Java It is almost the same... 您也可以检查以下问题的答案: 使用Java将HTML文件读取到DOM树中几乎是相同的...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM