使用标准Java从HTML段中提取文本

Question

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text. 我正在接收一段HTML文档作为Java String，我想提取它的内部文本。 for ex: hello world ----> hello world 例如： hello world ----> hello world

is there a way to extract the text using java standard library ? 有没有一种方法可以使用Java标准库提取文本？ something maybe more efficient than open/close tag regex with empty string? 也许比使用空字符串打开/关闭标签正则表达式更有效？ thanks, 谢谢，

Answer 1

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner . 不要使用正则表达式来解析HTML，而应使用HtmlCleaner之类的专用解析器。

Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt. 使用正则表达式通常可以在第一时间就可以工作，然后变得越来越复杂，直到无法适应为止。

Answer 2

Don't use regular expression to parse HTML , use for instance jsoup: Java HTML Parser . 不要使用正则表达式来解析HTML ，例如使用jsoup：Java HTML Parser 。 It has a convenient way to select elements from the DOM. 它具有从DOM中选择元素的便捷方法。

Example Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements: 示例提取Wikipedia主页，将其解析为DOM，然后从“新闻中”部分的标题中选择元素列表：
 Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements newsHeadlines = doc.select("#mp-itn ba"); 

There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser , which could be applied like this: JDK中还有一个HTML解析器： javax.swing.text.html.parser.Parser ，可以这样应用：

Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);

Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function: 然后，根据要查找的类型：开始标签，结束标签，属性等，定义适当的回调函数：

@Override
public void handleStartTag(HTML.Tag tag,
        MutableAttributeSet mutableAttributeSet, int pos) {

    // parses the HTML document until a <a> or <area> tag is found
    if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {

        // reading the href attribute of the tag
        String address = (String) mutableAttributeSet
                .getAttribute(Attribute.HREF);

    /* ... */

Answer 3

I will also say it - don't use regex with HTML. 我也会说-不要在HTML中使用正则表达式。 ;-) ;-)

You can give a shot with JTidy . 您可以使用JTidy试一试。

Answer 4

您可以使用HTMLParser ，这是一个开放源代码。

使用标准Java从HTML段中提取文本

问题描述

4 个解决方案

解决方案1
2 2012-07-12 07:38:51

解决方案2
2 2012-07-12 07:39:41

解决方案3
2 2012-07-12 07:40:28

解决方案4
1 2012-07-12 07:48:41

使用标准Java从HTML段中提取文本

问题描述

4 个解决方案

解决方案1 2 2012-07-12 07:38:51

解决方案2 2 2012-07-12 07:39:41

解决方案3 2 2012-07-12 07:40:28

解决方案4 1 2012-07-12 07:48:41

解决方案1
2 2012-07-12 07:38:51

解决方案2
2 2012-07-12 07:39:41

解决方案3
2 2012-07-12 07:40:28

解决方案4
1 2012-07-12 07:48:41