简体繁体 English

使用Solr / Lucene作为持久性技术

[英]Using Solr/Lucene as persistence technology

原文 2012-01-11 02:12:35 5 2 java/ solr/ lucene/ rdbms

Solr/Lucene's reverse index and query supports an subset of RDBMS functionalities, ie filtering, sorting, groupby, paging. Solr / Lucene的反向索引和查询支持RDBMS功能的子集，即过滤，排序，分组，分页。 In this sense it is very close to an nosql database as it also does not support transaction and joins. 从这个意义上讲，它非常接近nosql数据库，因为它也不支持事务和连接。

With framework like Hibernate-Search , it is possible to map even complex objects to the index and perform basic CRUD operations, while supporting full-text search. 使用像Hibernate-Search这样的框架，甚至可以将复杂对象映射到索引并执行基本的CRUD操作，同时支持全文搜索。

Considerations: 注意事项：

1) Write throughput From my past experience, Lucene index's write throughput is much lower than RDBMS 1）写入吞吐量根据我过去的经验，Lucene索引的写入吞吐量远低于RDBMS

2) Query Speed Query speed for Lucene index should be comparable, if not faster, due to the reverse index. 2）查询速度由于反向索引，Lucene索引的查询速度应该是可比的，如果不是更快的话。

3) Scalability Could be resolved using replication or Solr-cloud . 3）可伸缩性可以使用复制或Solr-cloud解决。

4) Ability to handle large data set I have used lucene index with 15M+ document on a single JVM without any performance issue. 4）处理大型数据集的能力我在单个JVM上使用lucene索引和15M +文档而没有任何性能问题。

Background: 背景：

I am currently using MongoDB with Solr and it is working well enough. 我目前正在使用带有Solr的MongoDB，它运行良好。 However, it is not as "simple" as i would like it to be due to: 但是，它并不像我希望的那样“简单”，原因在于：

Keeping mongo and Solr index in sync (not a trivial task) 保持mongo和Solr索引同步（不是一项简单的任务）
Transformation between Java object <-> mongo <-> solr ( SpringData and SolrJ helps, but still not great). Java对象< - > mongo < - > solr之间的转换（ SpringData和SolrJ有帮助，但仍然不是很好）。
Why use two "persistence" technology if one will do 为什么要使用两种“持久性”技术呢？

From the small scale test I have done so far, I haven't found any technical road block that would prevent me from using Solr/Lucene as persistence. 从我到目前为止所进行的小规模测试来看，我还没有发现任何阻止我使用Solr / Lucene作为持久性的技术障碍。 However, I also don't want to commit to such a drastic refactoring without more information. 但是，我也不想在没有更多信息的情况下进行如此激烈的重构。 I also aware of projects like Solandra with attempts to bring NoSQl and Solr together, but they don't seem to be mature enough. 我也意识到像Solandra这样的项目试图将NoSQl和Solr结合在一起，但它们似乎还不够成熟。

Question 题

So with applications where full-text search is an major (but not the only) requirement, is it then feasible to for-go traditional (RDBMS) and contemporary (NoSQL) data store? 因此，对于全文搜索是主要（但不是唯一）要求的应用程序，那么传统（RDBMS）和现代（NoSQL）数据存储是否可行？

Great Reference Thanks to raticulin 伟大的参考感谢raticulin

Atlassian (Jira) - Lucene Generic Data Indexing Atlassian（Jira） - Lucene通用数据索引

2 个解决方案

Lucene - Full Text Search/Information Retrieval Library. Lucene - 全文检索/信息检索库。 Solr - Enterprise Search Server built on top of Lucene. Solr - 基于Lucene构建的企业级搜索服务器。

Lucene/Solr should not be used in place of Persistence, neither they will be able to replace RDBMS nor it is a good thing to compare them to RDBMS, you are comparing apples & oranges. Lucene / Solr不应该用来代替Persistence，也不能代替RDBMS，也不能将它们与RDBMS进行比较，你要比较苹果和橘子。

As far index throughput speed of Lucene that you are comparing with RDBMS will not help & it is not right to compare directly, there could be a number of factors that affect Lucene throughput depending on your search schema configurations. 至于您与RDBMS比较的Lucene的索引吞吐速度无济于事并且直接比较是不对的，可能有许多因素会影响Lucene吞吐量，具体取决于您的搜索架构配置。
Lucene has one of the well known & best data structures for information retrieval, Query speed that you get depends on number of factors from configuration, HW etc.. Lucene拥有众所周知的最佳信息检索数据结构之一，您获得的查询速度取决于配置，硬件等因素的数量。
Obviously, that's the way to go. 显然，这是要走的路。
Handling 15M+ on a single JVM is great, but it does not go far without understanding Document size, feature set used, JVM Memory, CPU Cores etc... 在单个JVM上处理15M +是很好的，但是如果不了解文档大小，使用的功能集，JVM内存，CPU核心等等，它就不会有太大作用。

Now if your problem is that RDBMS is real scalability bottleneck, you could use pick a NoSQL datastore based on your persistence needs, which you could then with integrate Solr/Lucene to provide full-text search capability. 现在，如果您的问题是RDBMS是真正的可伸缩性瓶颈，您可以根据持久性需求选择NoSQL数据存储区，然后您可以通过集成Solr / Lucene来提供全文搜索功能。 Since NoSQL is rapidly evolving & fairly new you might not find fairly stable adapters to integrate Solr/Lucene with NoSQL. 由于NoSQL正在快速发展并且相当新，你可能找不到相当稳定的适配器来将Solr / Lucene与NoSQL集成。

Edit: 编辑：

Now that the question is updated, this is already well debated in this question NoSQL (MongoDB) vs Lucene (or Solr) as your database . 现在问题已经更新，这个问题已经在NoSQL（MongoDB）与Lucene（或Solr）作为您的数据库进行了争论。 It could be a pain to have too many moving parts, Lucene/Solr could very well replace MongoDB, depending on app. 拥有太多移动部件可能会很痛苦，Lucene / Solr可以很好地取代MongoDB，具体取决于应用程序。 But you have to consider NoSQL Data Store are built from ground up to be fully distributed, you dont lose or have limited functionality due to scaling, while Solr is not built with Distributed Computing in mind, so there are limitations Distributed Search limitations when it comes horizontal scaling. 但是你必须考虑NoSQL数据存储是从头开始构建完全分布式的，你不会因为扩展而丢失或功能有限，而Solr并不是在考虑分布式计算的情况下构建的，因此分布式搜索的局限性存在一定的局限性。水平缩放。 SolrCloud may be the answer too that.. SolrCloud也可能是答案..

I think I remember watching some presentation from Atlassian where they explained that for Jira the were using just Lucene nowadays, they had dropped their previous DB (whatever it was) and using Lucene as storage too. 我想我还记得看过Atlassian的一些演讲，他们解释说，对于Jira来说，现在只使用Lucene，他们放弃了以前的DB（不管它是什么），并使用Lucene作为存储。 They were happy. 他们很高兴。

If someone can confirm it was them would be cool. 如果有人能证实这是他们会很酷。

Edit: 编辑：

http://blogs.atlassian.com/rebelutionary/downloads/tssjs2007-lucene-generic-data-indexing.pdf http://blogs.atlassian.com/rebelutionary/downloads/tssjs2007-lucene-generic-data-indexing.pdf