简体繁体 English

使用Java HTML解析器提取文本

[英]Text extraction with java html parsers

原文 2010-04-09 18:37:38 5 3 java/ html/ text/ parsing/ extraction

I want to use an html parser that does the following in a nice, elegant way 我想使用一个HTML解析器，以一种优美，优雅的方式执行以下操作

Extract text (this is most important) 提取文字（这很重要）
Extract links, meta keywords 提取链接，元关键字
Reconstruct original doc (optional but nice feature to have) 重建原始文档（可选，但功能不错）

From my investigation so far jericho seems to fit. 根据我的调查，到目前为止，墨西哥煎蛋饼似乎很合适。 Any other open source libraries you guys would recommend? 你们会推荐其他开源库吗？

3 个解决方案

I recently experimented with HtmlCleaner and CyberNekoHtml. 我最近尝试了HtmlCleaner和CyberNekoHtml。 CyberNekoHtml is a DOM/SAX parser that produces predictable results. CyberNekoHtml是一个DOM / SAX解析器，可产生可预测的结果。 HtmlCleaner is a tad faster, but quite often fails to produce accurate results. HtmlCleaner快一点，但通常无法产生准确的结果。

I would recommend CyberNekoHtml. 我会推荐CyberNekoHtml。 CyberNekoHtml can do all of the things you mentioned. CyberNekoHtml可以完成您提到的所有事情。 It is very easy to extract a list of all elements, and their attributes, for example. 例如，提取所有元素及其属性的列表非常容易。 It would be possible to traverse the DOM tree building each element back into HTML if you wanted to reconstruct the page. 如果您要重建页面，则可以遍历将每个元素重新构建为HTML的DOM树。

There's a list of open source java html parsers here: http://java-source.net/open-source/html-parsers 这里有一个开源的Java html解析器列表： http : //java-source.net/open-source/html-parsers

I would definitely go for JSoup. 我肯定会去JSoup。

Very elegant library and does exactly what you need. 非常优雅的图书馆，可满足您的需求。

See Example Here 在这里查看示例

I ended up using HtmlCleaner http://htmlcleaner.sourceforge.net/ for something similar. 我最终使用HtmlCleaner http://htmlcleaner.sourceforge.net/进行了类似的操作。 It's really easy to use and was quick for what I needed. 它真的很容易使用，并且可以快速满足我的需求。