简体繁体 English

lucene搜索

[英]lucene searching

原文 2012-10-04 04:36:01 3 1 java/ lucene/ indexing/ full-text-search/ content-indexing

Dear StackOverFlow Developers I want a help from you . 尊敬的StackOverFlow开发人员：我需要您的帮助。 I am stuck in Apache lucene to use in java swing application . 我被困在Apache Lucene中，无法在Java swing应用程序中使用。 The problem is so complex that even im confused how should i ask it. 这个问题是如此复杂，以至于我什至感到困惑。 Please try to understand what is my actual requirement. 请尝试了解我的实际要求。 The case is the simple i have to give html files so that client can access them in swing application and for searching facility i decided to use apache lucene indexing. 这种情况很简单，我必须提供html文件，以便客户端可以在swing应用程序中访问它们，并且为了搜索功能，我决定使用apache lucene索引。 this is providing me the search facility but now i want to display the html file data which has matched the search criteria . 这为我提供了搜索工具，但现在我想显示符合搜索条件的html文件数据。 In java API im using swing for it and JEditorPane is the control in which i have to display the contents of html file . 在Java API中，即时通讯使用swing和JEditorPane是控件，我必须在其中显示html文件的内容。 Please suggest me how should i index the html files and how should i get the content of html files back from lucene index. 请建议我如何索引html文件以及如何从lucene索引获取html文件的内容。 the html files not only having text only but also they are having links , images etc. html文件不仅具有文本，而且具有链接，图像等。

thanks in advance hoping help from you regards 在此先感谢您的帮助

1 个解决方案

In one of our projects where we employed Lucene for full text indexing & search, we handled HTML files as follows: 在我们使用Lucene进行全文索引和搜索的项目之一中，我们按以下方式处理HTML文件：

Stored the HTML document as is on disk (you can store in the DB as well). 将HTML文档按原样存储在磁盘上（也可以存储在DB中）。
Using Jericho HTMLParser 's HTML->Text converter, we extracted the text, links etc., out of the HTML documents. 使用Jericho HTMLParser的HTML-> Text转换器，我们从HTML文档中提取了文本，链接等。
The lucene document has attributes that stored the metadata about the HTML file apart from the text content in the HTML in tokenized format. lucene文档具有一些属性，这些属性以令牌化格式存储了HTML文件中与HTML中的文本内容分开的元数据。
Used StandardAnalyzer to keep certain tokens like email, website links as is during the tokenization process before indexing. 使用StandardAnalyzer在标记化过程中按原样保留某些标记，例如电子邮件，网站链接，然后再建立索引。
Upon searching the index, the hits returned contained the metadata of the HTML files that matched the criteria. 搜索索引后，返回的匹配包含与条件匹配的HTML文件的元数据。 So, we were able to identify the HTML content to be displayed for a given search result. 因此，我们能够识别出要针对给定搜索结果显示的HTML内容。