简体   繁体   中英

Solr schema design and performance

I have books database that has three entities: Books, pages and titles (titles found in a page). I have got confused and concerned about performance between two approaches in the schema design:

1- Dealing with books as documents ie book field, pages field with multiValue and titles field with multiValue too. In this approach all of the book data will be represented in one Solr document with very large fields.

2- dealing with pages as documents which will lead in much smaller fields but larger number of documents.

I tried to look at this official resource but I could not able to find a clear answer for my question.

Assuming you are going to take Solr results and present them through another application, I would make the smallest item - Titles - the model for documents, which will make it much easier to present where a result appears. Doing it this way minimizes the amount of application code you need to write. If your users are querying Solr directly I might use Page as a my document instead - presumably you are using Solr's highlighting feature then to assist your users with identifying how their search term(s) matched.

For Title documents I would model the schema as follows:

  1. Book ID + Page Number + Title [string - unique key]
  2. Book ID [integer]
  3. Book Name [tokenized text field]
  4. Page Number [TrieIntField]
  5. Title [tokenized text field]
  6. Content for that book/title/page combination [tokenized text field]

There may be other attributes you want to capture, such as author, publication date, publisher, but you do not explain above what other information you have so I leave that out of this example.

Textual queries then can involve Book Name , Title and Content where you may want to define a single field that's indexed, but not stored, that serves as a target for <copyField/> declarations in your schema.xml to allow for easy searching over all three at the same time.

For indexing, without knowing more about the data being indexed, I would use the ICU Tokenizer and Snowball Porter Stemming Filter with a language specification on the text fields to handle non-English data - assuming all the books are in the same language. And if English, the Standard Tokenizer instead of ICU.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM