简体繁体 English

Solr模式设计和性能

[英]Solr schema design and performance

原文 2014-09-27 21:31:40 6 1 solr

I have books database that has three entities: Books, pages and titles (titles found in a page). 我有一个包含三个实体的图书数据库：图书，页面和标题（在页面中找到的标题）。 I have got confused and concerned about performance between two approaches in the schema design: 我对模式设计中两种方法之间的性能感到困惑和担忧：

1- Dealing with books as documents ie book field, pages field with multiValue and titles field with multiValue too. 1-将书籍作为文档处理，即书籍字段，具有multiValue的页面字段和具有multiValue的书名字段。 In this approach all of the book data will be represented in one Solr document with very large fields. 通过这种方法，所有书籍数据都将在一个Solr文档中以非常大的字段表示。

2- dealing with pages as documents which will lead in much smaller fields but larger number of documents. 2-将页面作为文档处理，这将导致较小的字段，但会导致大量的文档。

I tried to look at this official resource but I could not able to find a clear answer for my question. 我试图查看此官方资源，但无法为我的问题找到明确的答案。

1 个解决方案

Assuming you are going to take Solr results and present them through another application, I would make the smallest item - Titles - the model for documents, which will make it much easier to present where a result appears. 假设您要获取Solr结果并通过另一个应用程序显示，我将使用最小的项-标题-文档模型，这将使显示结果出现的位置更加容易。 Doing it this way minimizes the amount of application code you need to write. 这样做可以最大程度地减少您需要编写的应用程序代码量。 If your users are querying Solr directly I might use Page as a my document instead - presumably you are using Solr's highlighting feature then to assist your users with identifying how their search term(s) matched. 如果您的用户直接查询Solr，则我可以改用Page作为我的文档-大概您是在使用Solr的突出显示功能，然后帮助您的用户确定其搜索词的匹配方式。

For Title documents I would model the schema as follows: 对于标题文档，我将按以下方式对模式进行建模：

Book ID + Page Number + Title [string - unique key] 图书ID +页码+标题[字符串-唯一键]
Book ID [integer] 图书ID [整数]
Book Name [tokenized text field] 图书名称[标记的文本字段]
Page Number [TrieIntField] 页码[TrieIntField]
Title [tokenized text field] 标题[标记的文本字段]
Content for that book/title/page combination [tokenized text field] 该书/标题/页面组合的内容[标记的文本字段]

There may be other attributes you want to capture, such as author, publication date, publisher, but you do not explain above what other information you have so I leave that out of this example. 您可能还需要捕获其他属性，例如作者，发布日期，发布者，但是您在上面没有解释其他信息，因此我将其排除在本示例之外。

Textual queries then can involve Book Name , Title and Content where you may want to define a single field that's indexed, but not stored, that serves as a target for <copyField/> declarations in your schema.xml to allow for easy searching over all three at the same time. 然后，文本查询可能涉及Book Name Title Book Name ，“ Title和“ Content ，您可能希望在其中定义一个已索引但未存储的字段，该字段用作schema.xml中<copyField/>声明的目标，以便于轻松搜索所有内容。三个同时。

For indexing, without knowing more about the data being indexed, I would use the ICU Tokenizer and Snowball Porter Stemming Filter with a language specification on the text fields to handle non-English data - assuming all the books are in the same language. 对于索引，在不了解更多有关要索引的数据的情况下，我将使用ICU Tokenizer和Snowball Porter词干过滤器，并在文本字段上使用语言规范来处理非英语数据-假设所有书籍都使用同一语言。 And if English, the Standard Tokenizer instead of ICU. 如果是英语，则使用标准标记器而不是ICU。