简体   繁体   English

使用 Solr 搜索索引作为数据库 - 这是否“错误”?

[英]Using Solr search index as a database - is this “wrong”?

My team is working with a third party CMS that uses Solr as a search index.我的团队正在与使用 Solr 作为搜索索引的第三方 CMS 合作。 I've noticed that it seems like the authors are using Solr as a database of sorts in that each document returned contains two fields:我注意到作者似乎使用 Solr 作为排序数据库,因为每个返回的文档都包含两个字段:

  1. The Solr document ID (basically a classname and database id) Solr 文档 ID(基本上是类名和数据库 ID)
  2. An XML representation of the entire object整个对象的 XML 表示

So basically it runs a search against Solr, download the XML representation of the object, and then instantiate the object from the XML rather than looking it up in the database using the id.所以基本上它对 Solr 运行搜索,下载对象的 XML 表示,然后从 XML 实例化对象,而不是使用 id 在数据库中查找它。

My gut feeling tells me this is a bad practice.我的直觉告诉我这是一个不好的做法。 Solr is a search index, not a database... so it makes more sense to me to execute our complex searches against Solr, get the document ids, and then pull the corresponding rows out of the database. Solr 是一个搜索索引,而不是一个数据库......所以对我来说对 Solr 执行复杂的搜索,获取文档 ID,然后从数据库中提取相应的行更有意义。

Is the current implementation perfectly sound, or is there data to support the idea that this is ripe for refactoring?当前的实现是否完全合理,或者是否有数据支持重构已经成熟的想法?

EDIT: When I say "XML representation" - I mean one stored field that contains an XML string of all of the object's properties, not multiple stored fields.编辑:当我说“XML 表示”时 - 我的意思是一个存储字段,其中包含所有对象属性的 XML 字符串,而不是多个存储字段。

Yes, you can use SOLR as a database but there are some really serious caveats :是的,您可以将 SOLR 用作数据库,但有一些非常严重的警告:

  1. SOLR's most common access pattern, which is over http doesnt respond particularly well to batch querying. SOLR 最常见的访问模式,即通过 http 对批量查询的响应不是特别好。 Furthermore, SOLR does NOT stream data --- so you can't lazily iterate through millions of records at a time.此外,SOLR 不流式传输数据 --- 因此您不能一次懒惰地遍历数百万条记录。 This means you have to be very thoughtful when you design large scale data access patterns with SOLR.这意味着您在使用 SOLR 设计大规模数据访问模式时必须非常周到。

  2. Although SOLR performance scales horizontally (more machines, more cores, etc..) as well as vertically (more RAM, better machines, etc), its querying capabilities are severely limited compared to those of a mature RDBMS .尽管 SOLR 性能可以横向扩展(更多机器、更多内核等)以及纵向(更多 RAM、更好的机器等),但与成熟的 RDBMS 相比其查询能力受到严重限制 That said, there are some excellent functions, like the field stats queries, which are quite convenient.也就是说,有一些很好的功能,比如字段统计查询,非常方便。

  3. Developers who are used to using relational databases will often run into problems when they use the same DAO design patterns in a SOLR paradigm, because of the way SOLR uses filters in queries.习惯于使用关系数据库的开发人员在 SOLR 范式中使用相同的 DAO 设计模式时,经常会遇到问题,因为 SOLR 在查询中使用过滤器的方式。 There will be a learning curve for developing the right approach to building an application that uses SOLR for part of its large queries or statefull modifications .将有一个学习曲线来开发正确的方法来构建一个应用程序,该应用程序使用 SOLR 进行部分大型查询或有状态修改

  4. The "enterprisy" tools that allow for advanced session management and statefull entities that many advanced web-frameworks (Ruby, Hibernate, ...) offer will have to be thrown completely out the window .许多高级 Web 框架(Ruby、Hibernate 等)提供的允许高级会话管理和有状态实体的“企业”工具将不得不完全抛弃

  5. Relational databases are meant to deal with complex data and relationships - and they are thus accompanied by state of the art metrics and automated analysis tools.关系数据库旨在处理复杂的数据和关系——因此它们伴随着最先进的指标和自动化分析工具。 In SOLR, I've found myself writing such tools and manually stress-testing alot, which can be a time sink .在 SOLR 中,我发现自己编写了这样的工具并手动进行了很多压力测试,这可能会浪费时间

  6. Joining : this is the big killer.加入:这是大杀手。 Relational databases support methods for building and optimizing views and queries that join tuples based on simple predicates.关系数据库支持构建和优化基于简单谓词连接元组的视图和查询的方法。 In SOLR, there aren't any robust methods for joining data across indices.在 SOLR 中,没有任何可靠的方法可以跨索引连接数据。

  7. Resiliency : For high availability, SolrCloud uses a distributed file system underneath (ie HCFS).弹性:为了高可用性,SolrCloud 在底层使用分布式文件系统(即 HCFS)。 This model is quite different then that of a relational database, which usually does resiliency using slaves and masters, or RAID, and so on.该模型与关系数据库的模型完全不同,关系数据库通常使用从站和主站或 RAID 等来实现弹性。 So you have to be ready to provide the resiliency infrastructure SOLR requires if you want it to be cloud scalable and resistent.因此,如果您希望它具有云可扩展性和抗性,您必须准备好提供 SOLR 所需的弹性基础设施。

That said - there are plenty of obvious advantages to SOLR for certain tasks : (see http://wiki.apache.org/solr/WhyUseSolr ) -- loose queries are much easier to run and return meaningful results.也就是说 - 对于某些任务,SOLR 有很多明显的优势:(参见http://wiki.apache.org/solr/WhyUseSolr ) - 松散查询更容易运行并返回有意义的结果。 Indexing is done as a matter of default, so most arbitrary queries run pretty effectively (unlike a RDBMS, where you often have to optimize and de-normalize after the fact).索引是默认完成的,因此大多数任意查询都非常有效地运行(与 RDBMS 不同,在 RDBMS 中,您通常必须在事后进行优化和反规范化)。

Conclusion: Even though you CAN use SOLR as an RDBMS, you may find (as I have) that there is ultimately "no free lunch" - and the cost savings of super-cool lucene text-searches and high-performance, in-memory indexing, are often paid for by less flexibility and adoption of new data access workflows.结论:即使您可以将 SOLR 用作 RDBMS,您可能会发现(正如我所知道的)最终“没有免费的午餐” - 以及超酷的 lucene 文本搜索和高性能内存的成本节省索引,通常是通过降低灵活性和采用新的数据访问工作流来支付的。

It's perfectly reasonable to use Solr as a database, depending on your application.根据您的应用程序,将 Solr 用作数据库是完全合理的。 In fact, that's pretty much what guardian.co.uk is doing .事实上,这几乎就是Guardian.co.uk 正在做的事情

It's definitely not bad practice per se.这本身绝对不是坏习惯。 It's only bad if you use it the wrong way, just like any other tool at any level, even GOTOs.如果您以错误的方式使用它只会很糟糕,就像任何级别的任何其他工具一样,甚至是 GOTO。

When you say "An XML representation..." I assume you're talking about having multiple stored Solr fields and retrieving this using Solr's XML format, and not just one big XML-content field (which would be a terrible use of Solr).当你说“一个 XML 表示......”时,我假设你在谈论有多个存储的 Solr 字段并使用 Solr 的 XML 格式检索它,而不仅仅是一个大的 XML 内容字段(这将是 Solr 的一个糟糕的使用) . The fact that Solr uses XML as default response format is largely irrelevant, you can also use a binary protocol , so it's quite comparable to traditional relational databases in that regard. Solr 使用 XML 作为默认响应格式这一事实在很大程度上无关紧要,您也可以使用二进制协议,因此在这方面它与传统关系数据库相当。

Ultimately, it's up to your application's needs.最终,这取决于您的应用程序的需求。 Solr is primarily a text search engine, but can also act as a NoSQL database for many applications. Solr的主要文本搜索引擎,但也可以作为一个NoSQL的数据库,对于许多应用。

This was probably done for performance reasons, if it doesn't cause any problems I would leave it alone.这可能是出于性能原因而完成的,如果它不会引起任何问题,我会不理会它。 There is a big grey area of what should be in a traditional database vs a solr index.与 solr 索引相比,传统数据库中应该包含的内容存在很大的灰色区域。 Ive seem people do similar things to this (usually key value pairs or json instead of xml) for UI presentation and only get the real object from the database if needed for updates/deletes.我似乎人们为 UI 演示做了与此类似的事情(通常是键值对或 json 而不是 xml),并且只有在需要更新/删除时才从数据库中获取真实对象。 But all reads just go to Solr.但是所有读取都只转到 Solr。

I've seen similar things done because it allows for very fast lookup.我见过类似的事情,因为它允许非常快速的查找。 We're moving data out of our Lucene indexes into a fast key-value store to follow DRY principles and also decrease the size of the index.我们正在将数据从 Lucene 索引移入快速键值存储,以遵循 DRY 原则并减小索引的大小。 There's not a hard-and-fast rule for this sort of thing.这类事情没有硬性规定。

Adding to @Jayunit100 response, using solar as a database, you get availability and partition tolerance at the cost of some consistency.添加到@Jayunit100 响应中,使用太阳能作为数据库,您以一定的一致性为代价获得可用性和分区容错性。 There is going to be a configurable lag between what you write and when you can read it back.在你写的内容和你什​​么时候可以读回来之间会有一个可配置的延迟。

I had similar idea, in my case to store some simple json data in Solr, using Solr as a database.我有类似的想法,就我而言,将一些简单的 json 数据存储在 Solr 中,使用 Solr 作为数据库。 However, a BIG caveat that changed my mind was the Solr upgrade process.然而,改变我想法的一个重要警告是 Solr 升级过程。

Please see https://issues.apache.org/jira/browse/LUCENE-9127 .请参阅https://issues.apache.org/jira/browse/LUCENE-9127

Apparently, there has been in the past (pre v6) the recommendation to re-index documents after major version upgrades (not just use IndexUpdater) although you did not have to do this to maintain functionality (I cannot vouch for this myself, this is from what I have read).显然,过去(v6 之前)建议在主要版本升级后重新索引文档(不仅仅是使用 IndexUpdater),尽管您不必这样做来维护功能(我自己不能保证这一点,这是从我读过的)。 Now, after you have upgraded 2 major versions but did not re-index (actually, fully delete docs then the index files themselves) after the first major version upgrade, your core is now not recognized.现在,在您升级了 2 个主要版本但没有在第一次主要版本升级后重新索引(实际上,完全删除文档然后索引文件本身)后,您的核心现在无法识别。

Specifically in my case, I started with Solr v6.特别是在我的情况下,我从 Solr v6 开始。 After upgrade to v7, I ran IndexUpdater so index is now at v7.升级到 v7 后,我运行了 IndexUpdater,所以索引现在是 v7。 After upgrade to v8, the core would not load.升级到 v8 后,核心将无法加载。 I had no idea why - my index was at v7, so that satisfies the version-minus-1 compatibility statement from Solr, right?我不知道为什么 - 我的索引是 v7,所以满足 Solr 的 version-minus-1 兼容性声明,对吗? Well, no - wrong.嗯,没有 - 错了。

I did an experiment.我做了一个实验。 I started fresh from v6.6, created a core and added some documents.我从 v6.6 开始,创建了一个核心并添加了一些文档。 Upgraded to v7.7.3 and ran IndexUpdater, so index for that core is now at v7.7.3.升级到 v7.7.3 并运行 IndexUpdater,因此该核心的索引现在为 v7.7.3。 Upgraded to v8.6.0, after which the core would not load.升级到v8.6.0,之后核心将无法加载。 Then I repeated the same steps, except after running IndexUpdater I also re-indexed the documents.然后我重复了相同的步骤,除了在运行 IndexUpdater 之后我还重新索引了文档。 Same problem.同样的问题。 Then I again repeated everything, except I did not just re-index, I deleted the docs from the index and deleted the index files and then re-indexed.然后我再次重复所有内容,除了我不只是重新编制索引,我从索引中删除了文档并删除了索引文件,然后重新编制了索引。 Now, when I arrived in v8.6.0, my core was there and everything OK.现在,当我到达 v8.6.0 时,我的核心就在那里,一切正常。

So, the takeaway for the OP or anyone else contemplating this idea (using Solr as db) is that you must EXPECT and PLAN to re-index your documents/data from time to time, meaning you must store them somewhere else anyway (a previous poster alluded to this idea), which sort of defeats the concept of a database.因此,OP 或任何其他考虑这个想法的人(使用 Solr 作为数据库)的要点是,您必须 EXPECT 和 PLAN 不时重新索引您的文档/数据,这意味着您必须将它们存储在其他地方(以前的海报暗示了这个想法),这有点违背了数据库的概念。 Unless of course your Solr core/index will be short-lived (not last more than one major version Solr upgrade), you never intend to upgrade Solr more than 1 version, or the Solr devs change this upgrade limitation.当然,除非您的 Solr 核心/索引将是短暂的(不会持续超过一个主要版本的 Solr 升级),否则您永远不会打算升级 Solr 超过 1 个版本,或者 Solr 开发人员更改此升级限制。 So, as an index for data stored elsewhere (and readily available for re-indexing when necessary), Solr is excellent.因此,作为存储在其他地方的数据的索引(并且在必要时可以随时重新索引),Solr 非常出色。 As a database for the data itself, it strongly "depends".作为数据本身的数据库,它强烈地“依赖”。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM