简体繁体 English

我们如何使用Lucene，Solr或Nutch创建一个简单的搜索引擎？

[英]How do we create a simple search engine using Lucene, Solr or Nutch?

原文 2008-10-21 21:15:17 2 10 lucene/ solr/ nutch

Our company has thousands of PDF documents. 我们公司有数以千计的PDF文档。 How do we create a simple search engine using Lucene, Solr or Nutch? 我们如何使用Lucene，Solr或Nutch创建一个简单的搜索引擎？ We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's. 我们将提供一个基本的Java / JSP网页，人们可以输入单词并执行基本和/或查询，然后向他们显示所有匹配PDF的文档链接。

10 个解决方案

I have had good luck with lucene, but it is not click, install and search, it does require a bit of work. 我对lucene运气不错，但它不是点击，安装和搜索，它确实需要一些工作。
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/ , it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene. 如果您需要可以下载并安装并在10分钟内搜索的内容，请查看免费的Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/ ，它使用Lucene，但是打包以便配置它并且准备运行安装，一个更容易尝试Lucene的方法。

Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. 在Nutch中启用Nutch + Lucene + Pdf插件是您的解决方案。 Nutch allows you to parse pdfs by enabling the pdf plugin. Nutch允许您通过启用pdf插件来解析pdf。

Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface. Lucene将允许您索引已爬网和已解析的数据，Nutch具有servlet，可为您提供搜索界面。

We use the same for our internal lans. 我们对内部lans使用相同的内容。

Google Search Appliance http://www.google.com/enterprise/gsa/

None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own. Lucene系列中没有任何项目可以原生地处理PDF，但是您可以使用实用程序，并编写有关如何自行编写的实例。

Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. 不管你需要做什么，Lucene都会做很多事情，但就你的时间而言，就像Tony上面所说的那样。 Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative. 成千上万的文件真的不是那么多，所以你可以用更轻的替代品来逃避。

That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON 也就是说，我仍然建议看Solr - 它比Lucene更容易设置，支持备份，复制等，以及一个非常适合您的用例的漂亮JSON接口： http：http： //wiki.apache.org/solr/SolJSON

I think you want a system to manage your PDF file. 我想你想要一个系统来管理你的PDF文件。 Please try to use dspace system. 请尝试使用dspace系统。 Dspace is a digital library, it supports Lucene based on. Dspace是一个数字图书馆，它支持Lucene。 www.dspace.org. www.dspace.org。

Take a look at eprints . 看看电子邮件。 It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. 它包括一个工作流程，用于添加新文档，自动索引和缩略图PDF，并具有相当全面的全文搜索功能。 It can also be easily customised and branded. 它也可以轻松定制和品牌化。

Why re-invent the wheel. 为什么重新发明轮子。 Again. 再次。

A great free search technology you might look at is the IBM Yahoo! 您可能会看到的一个很棒的免费搜索技术是IBM Yahoo! free search. 免费搜索。 I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. 我不确定他们是否已经完成了使用Lucene的计划，但它仍然是使用免费搜索技术的东方之一。 It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. 我相信它可处理多达500K的文档，并且它还支持PDF和其他非文本格式。 Graphic user interface; 图形用户界面; easy to customize search results, and basic search analytics. 易于自定义搜索结果和基本搜索分析。 Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. 基本同义词库和强大的API，因此如果开箱即用的结果不符合您的喜好，您可以做任何你想做的事情。 We've suggested this to a number of clients where there were fewer than half a million documents, and they love it. 我们已经向一些客户提出了这个建议，这些客户的文档数量不到50万，而且他们喜欢它。

Answering such a broad question in this forum will be tough. 在这个论坛中回答如此广泛的问题将是艰难的。 I'd recommend you check out the book Lucene in Action , which covers the basics of indexing and searching in a quite readable fashion. 我建议你查看Lucene in Action这本书，它以可读的方式介绍索引和搜索的基础知识。

Given your application, it sounds like Nutch and Solr probably will not be necessary. 鉴于您的应用，听起来像Nutch和Solr可能没有必要。 Since all of your documents are available locally, Nutch probably won't be helpful. 由于您的所有文件都在本地提供，Nutch可能没有帮助。 Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner. 如果您的查询负载很高，Solr可以帮助您管理一组搜索者，但Lucene具有高度的性能，并且以非常可扩展的方式处理大型文档集。

The one area that might consume a lot of your effort is the use of PDF. 可能消耗大量精力的一个领域是使用PDF。 It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs , but depending on the document, the quality of results can vary. 可以索引PDF文档，并且Lucene有助于从PDF中提取原始文本，但根据文档，结果的质量可能会有所不同。 Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit. 通常，由于格式化指令，PDF文档中关键字的上下文不清楚，并且这使得难以进行邻近搜索或显示命中的上下文。

If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. 如果您有Linux服务器，可以使用Beagle为它们编制索引，然后只使用它附带的搜索功能。 It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well. 它有一个（实验性的）网络搜索界面，它也可以连接到FireFox搜索框中。

It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene. 它会自动为文件编制索引，我怀疑你会发现增强或修复beagle比将自己的搜索界面编写到Lucene更有效。

Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. 拥有（imho）在Mac上的独特优势，我在较旧的G5上使用SearchLight 。 nice web interface to spotlight, the Mac OS' built-in indexing service. 很好的网络界面，聚焦，Mac OS的内置索引服务。