简体繁体 English

帮助搜索引擎架构.NET C＃

[英]Help with Search Engine Architecture .NET C#

原文 2009-08-13 20:48:47 7 4 c#/ .net/ search/ full-text-search/ lucene

I'm trying to create a search engine for all literature (books, articles, etc), music, and videos relating to a particular spiritual group. 我正在尝试为所有与特定精神群体相关的文学（书籍，文章等），音乐和视频创建搜索引擎。 When a keyword is entered, I want to display a link to all the PDF articles where the keyword appears, and also all the music files and video files which are tagged with the keyword in question. 输入关键字后，我想显示指向所显示关键字的所有PDF文章的链接，以及所有使用相关关键字标记的音乐文件和视频文件。 The user should be able to filter it with information such as author/artist, place, date/time, etc. When the user clicks on one of the results links (book names, for instance), they are taken to another page where snippets from that book everywhere the keyword is found are displayed. 用户应该能够使用诸如作者/艺术家，地点，日期/时间等信息对其进行过滤。当用户点击其中一个结果链接（例如，书名）时，他们将被带到另一个页面，其中包含片段从那本书到处都可以找到关键字。

I thought of using the Lucene library (or Searcharoo) to implement my PDF search, but I also need a database to tag all the other information so that results can be filtered by author/artist information, etc. So I was thinking of having tables for Text, Music, and Videos, and a field containing the path to the file for each. 我想过使用Lucene库（或Searcharoo）来实现我的PDF搜索，但我还需要一个数据库来标记所有其他信息，以便结果可以通过作者/艺术家信息等进行过滤。所以我在考虑拥有表格用于文本，音乐和视频，以及包含每个文件路径的字段。 When a keyword is entered, I need to search the DB for music and video files, and also need to search the PDF's, and when a filter is applied, the music and video search is easy, but limiting the text search based on the filters is getting confusing. 输入关键字时，我需要在数据库中搜索音乐和视频文件，还需要搜索PDF文件，当应用过滤器时，音乐和视频搜索很容易，但是根据过滤器限制文本搜索越来越混乱。

Is my approach correct? 我的方法是否正确？ Are there better ways to do this? 有没有更好的方法来做到这一点？ Since the search content is limited only to the spiritual group, there is not an infinite number of items to search. 由于搜索内容仅限于精神群体，因此无需搜索无限数量的项目。 I'd say about 100-500 books and 1000-5000 songs. 我会说约100-500本书和1000-5000首歌。

4 个解决方案

Lucene is a great way to get up and running quickly without too much effort, along with several areas for extending the indexing and searching functionality to better suit your needs. Lucene是一种快速启动和运行而不需要太多努力的好方法，还有一些扩展索引和搜索功能的领域，以更好地满足您的需求。 It also has several built-in analyzers for common file types, such as HTML/XML, PDF, MS Word Documents, etc. 它还有几种用于常见文件类型的内置分析器，例如HTML / XML，PDF，MS Word文档等。

It provides the ability to use a variety of Fields, and they don't necessarily have to be uniform across all Documents (in other words, music files might have different attributes than text-based content, such as artist, title, length, etc.), which is great for storing different types of content. 它提供了使用各种字段的能力，并且它们不一定必须在所有文档中统一（换句话说，音乐文件可能具有与基于文本的内容不同的属性，例如艺术家，标题，长度等。。），非常适合存储不同类型的内容。

Not knowing the exact implementation of what you're working on, this may or may not be feasible, but for tagging and other related features, you might also consider using a database, such as MySQL or SQL Server side-by-side with the Lucene index. 不知道你正在做什么的确切实现，这可能是也可能不可行，但对于标记和其他相关功能，你也可以考虑使用数据库，例如MySQL或SQL Server并行Lucene指数。 Use the Lucene index for full-text search, then once you have a result set, go to the database to extract all the relational content. 使用Lucene索引进行全文搜索，然后在获得结果集后，转到数据库以提取所有关系内容。 Our company has done this before, and it's actually not as big of a headache as it sounds. 我们公司以前做过这件事，实际上并不像听起来那么令人头痛。

NOTE: If you decide to go this route, BE CAREFUL, as the "unique id" provided by Lucene is highly volatile (it changes everytime the index is optimized), so you will want to store the actual id (the primary key in the database) as a separate field on the Document. 注意：如果您决定采用这种方式，请小心，因为Lucene提供的“唯一ID”具有高度不稳定性（每次优化索引时它都会更改），因此您需要存储实际ID（主键在数据库）作为Document上的一个单独字段。

Another added benefit, if you are set on using C#.NET, there is a port called Lucene.Net, which is written entirely in C#. 另一个好处是，如果你使用C＃.NET，就会有一个名为Lucene.Net的端口，它完全用C＃编写。 The down-side here is that you're a few months behind on all the latest features, but if you really need them, you can always check out the Java source and implement the required updates manually. 这里的缺点是你在所有最新功能上落后了几个月，但如果你真的需要它们，你可以随时查看Java源代码并手动实现所需的更新。

Yes, there is a better approach. 是的，有一个更好的方法。 Try Solr and in particular check out facets. 尝试Solr ，特别是检查方面。 It will save you a lot of trouble. 它会为你省去很多麻烦。

You could try using MS Search Server Express Edition, one of the major benefits is that it is free. 您可以尝试使用MS Search Server Express Edition，其中一个主要好处是它是免费的。

http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none http://www.microsoft.com/enterprisesearch/en/us/search-server-express.aspx#none

If you definitely want to go the database route then you should use SQL Server with Full Text Search enabled. 如果您肯定想要使用数据库路由，那么您应该使用启用了全文搜索的 SQL Server。 You can use this with Express versions, too. 您也可以将它与Express版本一起使用。 You can then store and search the contents of PDFs very easily (so long as you install the free Adobe PDF iFilter). 然后，您可以非常轻松地存储和搜索PDF的内容（只要您安装免费的Adobe PDF iFilter）。