简体   繁体   English

使用标准Java从HTML段中提取文本

[英]extract text from HTML segment using standard java

I'm receiving a segment of HTML document as Java String and i would like to extract it's inner text. 我正在接收一段HTML文档作为Java String,我想提取它的内部文本。 for ex: hello world ----> hello world 例如: hello world ----> hello world

is there a way to extract the text using java standard library ? 有没有一种方法可以使用Java标准库提取文本? something maybe more efficient than open/close tag regex with empty string? 也许比使用空字符串打开/关闭标签正则表达式更有效? thanks, 谢谢,

Don't use regex to parse HTML but a dedicated parser like HtmlCleaner . 不要使用正则表达式来解析HTML,而应使用HtmlCleaner之类的专用解析器。

Using a regex will usually work at fist test, and then start to be more and more complex until it ends being impossible to adapt. 使用正则表达式通常可以在第一时间就可以工作,然后变得越来越复杂,直到无法适应为止。

Don't use regular expression to parse HTML , use for instance jsoup: Java HTML Parser . 不要使用正则表达式来解析HTML ,例如使用jsoup:Java HTML Parser It has a convenient way to select elements from the DOM. 它具有从DOM中选择元素的便捷方法。

Example Fetch the Wikipedia homepage, parse it to a DOM, and select the headlines from the In the news section into a list of Elements: 示例提取Wikipedia主页,将其解析为DOM,然后从“新闻中”部分的标题中选择元素列表:

 Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); Elements newsHeadlines = doc.select("#mp-itn ba"); 

There is also a HTML parser in the JDK: javax.swing.text.html.parser.Parser , which could be applied like this: JDK中还有一个HTML解析器: javax.swing.text.html.parser.Parser ,可以这样应用:

Reader in = new InputStreamReader(new URL(webpageURL).openConnection().getInputStream());
ParserDelegator parserDelegator = new ParserDelegator();
parserDelegator.parse(in, harvester, true);

Then, dependent on what kind you are looking for: start tags, end tags, attributes, etc. you define the appropriate callback function: 然后,根据要查找的类型:开始标签,结束标签,属性等,定义适当的回调函数:

@Override
public void handleStartTag(HTML.Tag tag,
        MutableAttributeSet mutableAttributeSet, int pos) {

    // parses the HTML document until a <a> or <area> tag is found
    if (tag == HTML.Tag.A || tag == HTML.Tag.AREA) {

        // reading the href attribute of the tag
        String address = (String) mutableAttributeSet
                .getAttribute(Attribute.HREF);

    /* ... */

I will also say it - don't use regex with HTML. 我也会说-不要在HTML中使用正则表达式。 ;-) ;-)

You can give a shot with JTidy . 您可以使用JTidy试一试

您可以使用HTMLParser ,这是一个开放源代码。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM