简体繁体 English

是否有标准的Java SE HTML解析器？如果是这样，为什么要使用非标准的呢？

[英]Is there a Standard Java SE HTML Parser? If so, why use non-standard ones?

原文 2012-01-31 07:14:48 2 1 java/ html/ html-parsing/ html-parser

I need to parse a simple HTML page with a simple form in it. 我需要解析一个简单的HTML页面，其中包含一个简单的表单。 The answers to similar questions on StackOverflow suggest using one of a large variety of non-standard Java libraries such as TagSoup, JSoup, HTMLParser and many others. StackOverflow上类似问题的答案建议使用各种非标准Java库之一，如TagSoup，JSoup，HTMLParser等等。

However, a web search revealed that there exists some standard functionality in Java SE via this class: http://docs.oracle.com/javase/7/docs/api/javax/swing/text/html/parser/ParserDelegator.html 但是，网络搜索显示Java SE中存在一些标准功能： http ： //docs.oracle.com/javase/7/docs/api/javax/swing/text/html/parser/ParserDelegator.html

My sub-questions are: 我的子问题是：

Is it really true that the standard ParserDelegator class can parse a use case like mine? 标准的ParserDelegator类是否可以解析像我这样的用例？
What are the limitations of the standard library that create the need for so many non-standard libraries? 标准库有哪些限制需要这么多非标准库？
Does the fact that ParserDelegator is within swing preclude using it in a regular EC2 cloud server for a web application? ParserDelegator在摇摆范围内的事实是否排除在常规EC2云服务器中使用它以用于Web应用程序？ Would I have to jump through a lot of hoops to get around the headless aspect or would it be just a small tweak to the configuration? 我是否必须通过大量的箍来绕过无头的方面，或者只是对配置进行一些小调整？
If the standard one is not recommended, which non-standard one should I use, given: (a) my desire to not stray far from the standard; 如果不推荐标准的，我应该使用哪一个非标准的，给出：（a）我不偏离标准的愿望; (b) my simple use case; （b）我的简单用例; (c) desire for a mature reliable implementation; （c）希望成熟可靠的实施; and (d) no size or weight limitations since this is a server application as opposed to an embedded client. （d）没有尺寸或重量限制，因为这是服务器应用程序而不是嵌入式客户端。 API is a far lower priority so while I do appreciate JSoup's CSS selector like API, the other concerns (a) through (d) override it. API是一个低得多的优先级，所以虽然我很欣赏JSoup的CSS选择器，如API，其他问题（a）到（d）覆盖它。

Thank you. 谢谢。

1 个解决方案

JDK has built-in HTML parser that supports HTML 1.0 or so. JDK内置HTML解析器，支持HTML 1.0左右。 It should support parsing of base text formatting tags and forms. 它应该支持解析基本文本格式标签和表单。

The reason to use other, third party parsers is requirement to support "real" HTML pages DHTML, JavaScript etc. 使用其他第三方解析器的原因是需要支持“真实”HTML页面DHTML，JavaScript等。

JSoup is one of popular parsers that can do the job. JSoup是可以完成这项工作的流行解析器之一。 For more information about other implementations please take a look on the following discussion: 有关其他实现的更多信息，请查看以下讨论：

Pure Java HTML viewer/renderer for use in a Scrollable pane 用于Scrollable窗格的纯Java HTML查看器/渲染器