简体   繁体   English

弹性搜索索引和关系数据库中的索引有什么区别?

[英]What is the difference between an elastic search index and an index in a relational database?

It seems that in elastic search you would define an index on a collection, whereas in a relational DB you would define your index on a column.似乎在弹性搜索中,您将在集合上定义索引,而在关系数据库中,您将在列上定义索引。 If the entire collection is indexed, why does it need to be defined?如果整个集合都被索引了,为什么还要定义它呢?

There is unfortunate usage of the word "index" which means slightly (edit: VERY) different things in ES and relational databases as they are optimized for different use cases.不幸的是,“索引”一词在 ES 和关系数据库中的含义略有不同(编辑:非常),因为它们针对不同的用例进行了优化。

An "index" in database is a secondary data structure which makes WHERE queries and JOIN s fast, and they typically store values exactly as they appear in the table.数据库中的“索引”是一种辅助数据结构,可以使WHERE查询和JOIN快速进行,并且它们通常存储与表中显示的值完全相同的值。 You can still have columns which aren't indexed, but then WHERE s require a full table scan which is slow on large tables.你仍然可以有没有索引的列,但是WHERE s 需要全表扫描,这在大表上很慢。

An "index" in ES is actually a schematic collection of documents, similar to a database in the relational world. ES 中的“索引”实际上是文档的示意性集合,类似于关系世界中的数据库。 You can have different "types" of documents in ES, quite similar to tables in dbs.您可以在 ES 中拥有不同“类型”的文档,这与 dbs 中的表非常相似。 ES gives you the flexibility of defining for each document's field whether you want to be able to retrieve it, search by it or both. ES 使您可以灵活地为每个文档的字段定义您是否希望能够检索它、通过它进行搜索或两者兼而有之。 Some details on these options can be found from for example here , also related to _source field (the original JSON which was submitted to ES).有关这些选项的一些详细信息可以从例如此处找到,也与_source字段(提交给 ES 的原始 JSON)相关。

ES uses an inverted index to efficiently find matching documents, but most importantly it typically "normalizes" strings into tokens so that accurate free-text search can be performed. ES 使用倒排索引来有效地查找匹配的文档,但最重要的是,它通常将字符串“规范化”为标记,以便可以执行准确的自由文本搜索。 For example sentences might be splitted into individual words, words are normalized to lower case etc. so that searching for "holland" would match the text "Vacation at Holland 2015".例如,句子可能会被拆分成单独的单词,单词被规范化为小写等,以便搜索“holland”将匹配文本“Vacation at Holland 2015”。

If a field does not have an inverted index, you cannot perform any searching on it (unlike dbs' full table scan).如果一个字段没有倒排索引,则不能对其执行任何搜索(与 dbs 的全表扫描不同)。 Interestingly you can also define fields so that you can use them for searching but you cannot retrieve them back, it is mainly beneficial when minimizing in disk and RAM usage is important.有趣的是,您还可以定义字段,以便您可以使用它们进行搜索,但您无法将它们取回,这主要在磁盘和 RAM 使用量最小化很重要时很有用。

Elastic search is by design a search engine not likely preferred for primary storage like SQL server or Mongo DB etc.弹性搜索在设计上是一种搜索引擎,不太可能成为 SQL Server 或 Mongo DB 等主存储的首选。

Why entire collection is indexed?为什么要对整个集合进行索引?

Elastic search internally uses a structure called inverted index which stores each fields(column) value for searching. Elastic Search 内部使用一种称为倒排索引的结构,它存储每个字段(列)的值以供搜索。 If the field contains string it will tokenize it, and perform filtering like lower case or upper case etc.如果该字段包含字符串,它将对其进行标记,并执行小写或大写等过滤。

Any way you can find only the data that are available in inverted index.任何方式都只能找到倒排索引中可用的数据。 So by default elastic search perform indexing for all fields to make it available/searchable to you.因此,默认情况下弹性搜索对所有字段执行索引以使其对您可用/可搜索。

https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html

This is not the like adding index for Relational DB.这不像为关系数据库添加索引。 In Relational DB you have all the data available then what you need is to index most used columns for quicker find.在关系数据库中,您拥有所有可用数据,然后您需要为最常用的列建立索引以便更快地查找。 But its vary less efficient to finding all the rows containing a part of a given word(searching a word)但是查找包含给定单词一部分的所有行(搜索单词)的效率较低

I'll refer to:我会提到:

"It seems that in elastic search you would define an index on a collection" “似乎在弹性搜索中你会在集合上定义一个索引”

In Elasticsearch, an index is like a database in the relational world.在 Elasticsearch 中,索引就像关系世界中的数据库。 The index contains multiple documents just like a relational database contain tables.索引包含多个文档,就像关系数据库包含表一样。

Until now, it is very clear.到现在,已经很清楚了。

In order to manage large amount of data, Elasticsearch (as a distributed database by nature) breaks each index into smaller chunks which are called shards which are being distributed across the Elasticsearch nodes.为了管理大量数据,Elasticsearch(本质上是一个分布式数据库)将每个索引分成更小的块,这些块称为分片,分布在 Elasticsearch 节点上。

The confusion starts with the fact the shards are data structures which are based on the Apache Lucene library.混淆始于这样一个事实,即分片是基于Apache Lucene库的数据结构。
Apache Lucene's index falls into the family of indexes known as an inverted index . Apache Lucene 的索引属于称为倒排索引的索引系列。

It is called "inverted index" because it list for a term, the documents that contain it:它被称为“倒排索引” ,因为它列出了一个术语,包含它的文档:

Term           Document                 Frequency
Brasil         doc_id_1, doc_id_8       4 (2 in doc_id_1, 2 in doc_id_8)
Argentina      doc_id_1, doc_id_6       3 (2 in doc_id_1, 1 in doc_id_6)

So, as you can see above, this structure stores statistics (frequencies) about terms in order to make term-based search more efficient.因此,正如您在上面所看到的,此结构存储有关术语的统计信息(频率),以便使基于术语的搜索更加高效。

(*) This is an inverse (Term -> Document) of the natural relationship, in which documents list terms (Document -> Terms) . (*) 这是自然关系的逆向(Term -> Document) ,其中文档列出术语(Document -> Terms)


Summary:概括:

1 ) Elasticsearch index: 1)弹性搜索索引:
There are 2 different usages for the word "index". “索引”一词有两种不同的用法。
One is quiet trivial - index is like a database.一个是微不足道的——索引就像一个数据库。
The other is confusing - Shards are based on a data structure named "inverted index".另一个令人困惑 - 分片基于名为“倒排索引”的数据结构。

2 ) Relational Databases index: 2)关系数据库索引:
A structure which is associated with a table or view that speeds retrieval of rows from the table or view.与表或视图关联的结构,可加速从表或视图中检索行。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM