简体繁体 English

用于清理HTML的Java库就像浏览器一样

[英]Java library for cleaning up HTML just like a browser would

原文 2011-05-24 15:43:35 2 3 java/ html/ html-parsing

So here's the challenge... I need to create clean HTML from random web pages out there in the wild. 所以这就是挑战......我需要在野外随机网页中创建干净的HTML。 My goal is to read in a page and pass it off to a library which will in turn give me back perfectly well-formed HTML. 我的目标是在一个页面中读取并将其传递给一个库，这将反过来给我一个完美的HTML格式。

Doesn't sound so tough, right? 听起来不那么厉害吧？ After all, every browser on the market effectively deals with the challenge of malformed HTML and turning it into something render-able with nearly every page load. 毕竟，市场上的每个浏览器都能有效地应对格式错误的HTML的挑战，并将其转化为几乎每个页面加载都可渲染的东西。 Each has its own slightly particular algorithm for cleaning up the contents (ahem...for HTML < 5 that is), but they tend to do a very good job of capturing what i like to refer to as the author's intention. 每个都有自己的略微特定的清理内容的算法（对于HTML <5来说是这样），但是他们倾向于非常好地捕捉我想要作为作者意图的内容。 So then, why can't I find a good java library for this very task? 那么，为什么我不能为这项任务找到一个好的java库呢？

One thing to mention is that I'm not at all interested in parsing the HTML as XML. 有一点需要提及的是，我对将HTML解析为XML并不感兴趣。 I've found that libraries such as NekoHTML, TagSoup, HtmlCleaner, and JTidy (to name a few) are more focused on solving the problem of converting to HTML to valid XML, and in the process, they lose sight of how the poorly-formatted document should be re-structured. 我发现像NekoHTML，TagSoup，HtmlCleaner和JTidy这样的库（仅举几例）更侧重于解决将HTML转换为有效XML的问题，并且在此过程中，他们忽略了如何糟糕 - 格式化文档应重新构建。 With nasty HTML they frequently don't capture the author's intention and spit out documents that render quite differently from the original source. 使用令人讨厌的HTML，他们经常不会捕获作者的意图并吐出与原始源完全不同的文档。 And for this project, it's of the utmost importance that the two documents render similarly. 对于这个项目，两个文件的呈现方式同样至关重要。

I am quite fond of Jericho HTML, but it doesn't seem to be the ideal candidate for this job...at least not without a lot of effort on my part. 我非常喜欢Jericho HTML，但它似乎并不是这项工作的理想人选...至少在我没有付出很多努力的情况下。 Also, Native dependencies are a no-go, so the mozilla parser is out. 此外，Native依赖项是不行的，因此mozilla解析器已经完成。

Can anyone help me in my search for the perfect HTML parser? 任何人都可以帮助我寻找完美的HTML解析器吗？ Thanks in advance! 提前致谢！