简体   繁体   English

solr 中的倒排索引和 docValues 有什么不同?

[英]What's different between inverted index and docValues in solr?

I have read many article here.我在这里读过很多文章。 I have concluded that index is for searching, and docValue is for sorting, faceting.我得出的结论是 index 用于搜索,而 docValue 用于排序、分面。 I am confused that whether index and docValue are the same data structure or same idea(store column value to get doc id)?我很困惑 index 和 docValue 是否是相同的数据结构或相同的想法(存储列值以获取 doc id)? If it is not the same, where is the different?如果不一样,哪里不一样?

Inverted index ::倒排索引::

Inverted Index is a concept, which is used for building the search library Lucene.倒排索引是一个概念,用于构建搜索库 Lucene。 The standard way that Solr builds the index is with an inverted index. Solr 构建索引的标准方式是使用倒排索引。 This style builds a list of terms found in all the documents in the index and next to each term is a list of documents that the term appears in (as well as how many times the term appears in that document).这种风格构建了在索引中所有文档中找到的术语列表,每个术语旁边是该术语出现的文档列表(以及该术语在该文档中出现的次数)。 This makes search very fast - since users search by terms, having a ready list of term-to-document values makes the query process faster.这使得搜索速度非常快——由于用户按术语搜索,因此拥有一个现成的术语到文档值列表可以使查询过程更快。 This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book.这就像通过扫描书后的索引来检索与关键字相关的书中的页面,而不是搜索书的每一页的每个单词。 This type of index is called an inverted index, because it inverts a page-centric data structure (page->words) to a keyword-centric data structure (word->pages).这种类型的索引称为倒排索引,因为它将以页面为中心的数据结构(page->words)反转为以关键字为中心的数据结构(word->pages)。 Solr stores this index in a directory called index in the data directory. Solr 将此索引存储在数据目录中名为 index 的目录中。

DocValue ::文档值::

For other features that we now commonly associate with search, such as sorting, faceting, and highlighting, this approach is not very efficient.对于我们现在通常与搜索相关联的其他功能,例如排序、分面和突出显示,这种方法效率不高。 The faceting engine, for example, must look up each term that appears in each document that will make up the result set and pull the document IDs in order to build the facet list.例如,构面引擎必须查找每个文档中出现的每个术语,这些词将构成结果集并提取文档 ID 以构建构面列表。 In Solr, this is maintained in memory, and can be slow to load (depending on the number of documents, terms, etc.).在 Solr 中,这是在 memory 中维护的,加载速度可能很慢(取决于文档数量、术语等)。

In Lucene 4.0, a new approach was introduced.在 Lucene 4.0 中,引入了一种新方法。 DocValue fields are now column-oriented fields with a document-to-value mapping built at index time. DocValue 字段现在是面向列的字段,具有在索引时构建的文档到值的映射。 This approach promises to relieve some of the memory requirements of the fieldCache and make lookups for faceting, sorting, and grouping much faster.这种方法有望减轻 fieldCache 的一些 memory 要求,并使分面、排序和分组的查找速度更快。

For docValues, you only need to enable it for a field that you will use it with.对于 docValues,您只需为您将使用它的字段启用它。 As with all schema design, you need to define a field type and then define fields of that type with docValues enabled.与所有模式设计一样,您需要定义字段类型,然后在启用 docValues 的情况下定义该类型的字段。 Enabling a field for docValues only requires adding docValues="true" to the field.为 docValues 启用字段只需要将 docValues="true" 添加到该字段。 DocValues are only available for specific field types. DocValues 仅适用于特定的字段类型。

<field name="category" type="string" indexed="false" stored="false" docValues="true" />

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM