简体   繁体   English

Lucene和SQL Server-最佳实践

[英]Lucene and SQL Server - best practice

I am pretty new to Lucene, so would like to get some help from you guys :) 我对Lucene来说还很陌生,所以想从你们这里得到一些帮助:)

BACKGROUND: Currently I have documents stored in SQL Server and want to use Lucene for full-text/tag searches on those documents in SQL Server. 背景:目前,我有存储在SQL Server中的文档,并且想使用Lucene对SQL Server中的那些文档进行全文/标记搜索。

Q1) In this case, in order to do the keyword search on the documents, should I insert all of those documents to the Lucene index? Q1)在这种情况下,为了对文档进行关键字搜索,我是否应该将所有这些文档都插入到Lucene索引中? Does this mean there will be data duplication (one in SQL Server and the other one in the Lucene index?) It could be a matter since we have a massive amount of documents (about 100GB). 这是否意味着会有数据重复(一个在SQL Server中,另一个在Lucene索引中?),这可能是一个问题,因为我们有大量的文档(大约100GB)。 Is it inevitable? 这是不可避免的吗?

Q2) Also, each documents has a set of tags (up to 3). Q2)此外,每个文档都有一组标签(最多3个)。 Lucene is also good choice for the tag search? Lucene还是标签搜索的好选择? If so, how to do it? 如果是这样,该怎么办?

Thanks, 谢谢,

Yes, providing full-text search through Lucene and data storage through a traditional database is a well-supported architecture. 是的,通过Lucene提供全文搜索并通过传统数据库提供数据存储是一个受支持的体系结构。 Take a look here , for a brief introduction. 在这里看看 ,以作简要介绍。 A typical implementation would be to index anything you wish to be able to support searching on, and store only a unique identifier in the Lucene index, and pull any records founds by a search from the database, based on the ID. 一种典型的实现方式是对希望支持搜索的任何内容建立索引,并在Lucene索引中仅存储唯一标识符,并基于ID从数据库中检索通过搜索找到的所有记录。 If you want to reduce DB load, you can store some information in Lucene to display a list of search results, and only query the database in order to fetch the full document. 如果要减少数据库负载,可以在Lucene中存储一些信息以显示搜索结果列表,并且仅查询数据库以获取完整文档。

As for saving on space, there will be some measure of duplication. 至于节省空间,将有一些重复措施。 This is true even if you only Lucene, though. 即使您只是Lucene,也是如此。 Lucene stores the inverted index used for searching entirely separately from stored data. Lucene将用于搜索的倒排索引与存储的数据完全分开存储。 For saving on space, I'd recommend being very deliberate about what data you choose to index, and what you need to store and be able to retrieve later. 为了节省空间,我建议您谨慎选择要索引的数据以及需要存储和以后检索的数据。 What you store is particularly important for saving space in Lucene, since indexed-only values tend to be very space-efficient, in most cases. 存储的内容对于节省Lucene的空间尤为重要,因为在大多数情况下,仅索引值往往非常节省空间。

Lucene can certainly implement a tag search. Lucene当然可以实现标签搜索。 The simple way to implement it would be to add each tag to a field of your choosing (I'll call is "tags", which seems to make sense), while building the document, such as: 实现它的简单方法是在构建文档时,将每个标签添加到您选择的字段中(我称之为“标签”,这似乎很有意义),例如:

document.add(new Field("tags", "widget", Field.Store.NO, Field.Index.ANALYZED));
document.add(new Field("tags", "forkids", Field.Store.NO, Field.Index.ANALYZED));

and I could simply add a required term to any query to search only within a particular tag. 并且我可以简单地在所有查询中添加必需的术语,以仅在特定标签内进行搜索。 For instance, if I was to search for "some stuff", but only with the tag "forkids", I could write a query like: 例如,如果我要搜索“一些东西”,但仅使用标签“ forkids”,则可以编写如下查询:

some stuff +tags:forkids

Documents can also be stored in Lucene, you can retrieve and reference them using the document ID. 文档也可以存储在Lucene中,您可以使用文档ID检索和引用它们。

I would suggest using Solr http://lucene.apache.org/solr/ on top of Lucene, is more user friendly and has multiValued fields (for the tags) available by default. 我建议在Lucene的顶部使用Solr http://lucene.apache.org/solr/ ,它更加用户友好,并且默认情况下具有multiValued字段(用于标记)。

http://wiki.apache.org/solr/SchemaXml http://wiki.apache.org/solr/SchemaXml

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM