简体   繁体   English

Solr模式设计和性能

[英]Solr schema design and performance

I have books database that has three entities: Books, pages and titles (titles found in a page). 我有一个包含三个实体的图书数据库:图书,页面和标题(在页面中找到的标题)。 I have got confused and concerned about performance between two approaches in the schema design: 我对模式设计中两种方法之间的性能感到困惑和担忧:

1- Dealing with books as documents ie book field, pages field with multiValue and titles field with multiValue too. 1-将书籍作为文档处理,即书籍字段,具有multiValue的页面字段和具有multiValue的书名字段。 In this approach all of the book data will be represented in one Solr document with very large fields. 通过这种方法,所有书籍数据都将在一个Solr文档中以非常大的字段表示。

2- dealing with pages as documents which will lead in much smaller fields but larger number of documents. 2-将页面作为文档处理,这将导致较小的字段,但会导致大量的文档。

I tried to look at this official resource but I could not able to find a clear answer for my question. 我试图查看此官方资源,但无法为我的问题找到明确的答案。

Assuming you are going to take Solr results and present them through another application, I would make the smallest item - Titles - the model for documents, which will make it much easier to present where a result appears. 假设您要获取Solr结果并通过另一个应用程序显示,我将使用最小的项-标题-文档模型,这将使​​显示结果出现的位置更加容易。 Doing it this way minimizes the amount of application code you need to write. 这样做可以最大程度地减少您需要编写的应用程序代码量。 If your users are querying Solr directly I might use Page as a my document instead - presumably you are using Solr's highlighting feature then to assist your users with identifying how their search term(s) matched. 如果您的用户直接查询Solr,则我可以改用Page作为我的文档-大概您是在使用Solr的突出显示功能,然后帮助您的用户确定其搜索词的匹配方式。

For Title documents I would model the schema as follows: 对于标题文档,我将按以下方式对模式进行建模:

  1. Book ID + Page Number + Title [string - unique key] 图书ID +页码+标题[字符串-唯一键]
  2. Book ID [integer] 图书ID [整数]
  3. Book Name [tokenized text field] 图书名称[标记的文本字段]
  4. Page Number [TrieIntField] 页码[TrieIntField]
  5. Title [tokenized text field] 标题[标记的文本字段]
  6. Content for that book/title/page combination [tokenized text field] 该书/标题/页面组合的内容[标记的文本字段]

There may be other attributes you want to capture, such as author, publication date, publisher, but you do not explain above what other information you have so I leave that out of this example. 您可能还需要捕获其他属性,例如作者,发布日期,发布者,但是您在上面没有解释其他信息,因此我将其排除在本示例之外。

Textual queries then can involve Book Name , Title and Content where you may want to define a single field that's indexed, but not stored, that serves as a target for <copyField/> declarations in your schema.xml to allow for easy searching over all three at the same time. 然后,文本查询可能涉及Book Name Title Book Name ,“ Title和“ Content ,您可能希望在其中定义一个已索引但未存储的字段,该字段用作schema.xml中<copyField/>声明的目标,以便于轻松搜索所有内容。三个同时。

For indexing, without knowing more about the data being indexed, I would use the ICU Tokenizer and Snowball Porter Stemming Filter with a language specification on the text fields to handle non-English data - assuming all the books are in the same language. 对于索引,在不了解更多有关要索引的数据的情况下,我将使用ICU TokenizerSnowball Porter词干过滤器 ,并在文本字段上使用语言规范来处理非英语数据-假设所有书籍都使用同一语言。 And if English, the Standard Tokenizer instead of ICU. 如果是英语,则使用标准标记器而不是ICU。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM