简体繁体 English

列式数据库如何进行索引？

[英]How does columnar Databases do indexing?

原文 2018-02-08 16:59:07 8 3 mysql/ database

I understand that the columnar databases put column data together on the disk rather than rows.我知道列式数据库将列数据放在磁盘上而不是行上。 I also understand that in traditional row-wise RDBMS, leaf index node of B-Tree contains pointer to the actual row.我也明白在传统的 row-wise RDBMS 中，B-Tree 的叶子索引节点包含指向实际行的指针。

But since columnar doesn't store rows together, and they are particularly designed for columnar operations, how do they differ in the indexing techniques?但是由于 columnar 不会将行存储在一起，而且它们是专门为列式操作而设计的，它们在索引技术上有何不同？

Do they also use B-tress?他们也使用B-tress吗？
How do they index inside whatever datastructure they use?他们如何在他们使用的任何数据结构中建立索引？
Or there is no accepted format, every vendor have their own indexing scheme to cater their needs?或者没有公认的格式，每个供应商都有自己的索引方案来满足他们的需求？

I have been searching, but unable to find any text.我一直在寻找，但找不到任何文字。 Every text I found is for row-wise DBMS.我发现的每个文本都是针对行式 DBMS 的。

3 个解决方案

If you understand 1)How columnar DBs store the data actually, and 2)How Indexes work, (how they store the data) Then you may feel that there is no need of indexing in columnar Dbs.如果您了解 1) 列式 DB 实际如何存储数据，以及 2) 索引如何工作，（它们如何存储数据）那么您可能会觉得在列式 Db 中不需要索引。

For any kind of database rowid is very important, it is like the address where the data is stored.对于任何一种数据库，rowid 都是非常重要的，它就像数据存储的地址。 Indexing is nothing but, mapping the rowids to the columns that are being indexed in a sorted order.索引不过是将 rowids 映射到按排序顺序索引的列。 Columnar databases are born basing this logic.列式数据库就是基于这种逻辑而诞生的。 They try to store the data in this fashion itself, meaning - They store the data as a key-value pair in a serialized manner where the actual column value is Key and the rowid when the data is residing as its value and if they find any duplicates for a key, they just compress and store.他们尝试以这种方式本身存储数据，这意味着 - 他们以序列化方式将数据存储为键值对，其中实际列值是 Key 和 rowid 当数据作为其值驻留时，如果他们找到任何密钥的重复项，它们只是压缩和存储。

So if you compare the format how columnar databases store the data actually on Disk, it is almost the same (but not exactly because, as the difference is compression, representation of key-value in a vice versa fashion) how the row oriented databases store indexes.因此，如果您比较列式数据库如何在磁盘上实际存储数据的格式，它几乎相同（但不完全相同，因为不同之处在于压缩，反之亦然的键值表示）面向行的数据库如何存储索引。

That's the reason you don't need separate indexing again.这就是您不再需要单独索引的原因。 and you won't find any columnar database trying to implement indexing.并且您不会发现任何试图实现索引的列式数据库。

There are no BTrees.没有 BTree。 (Or, if they are, they are not the main part of the design) （或者，如果是，则它们不是设计的主要部分）

Infinidb stores 64K rows per chunk. Infinidb 每块存储 64K 行。 Each column in that chunk is compressed and indexed.该块中的每一列都被压缩和索引。 With the chunk is a list of things like min, max, avg, etc, for each column that may or may not help in queries.块是一个列表，如 min、max、avg 等，对于每个列可能有助于也可能不有助于查询。

Running a SELECT first looks at that summary info for each chunk to see if the WHERE clause might be satisfied by any of the rows in the chunk.运行SELECT首先查看每个块的摘要信息，以查看块中的任何行是否可能满足WHERE子句。

The chunks that pass that filtering get looked at in more detail.更详细地查看通过该过滤的块。

There is no copy of a row.没有行的副本。 Instead, if, say, you ask for SELECT a,b,c , then the compressed info for 64K rows (in one chunk) for each of a, b, c need to be decompressed to further filter and deliver the row.相反，如果，比如说，您要求SELECT a,b,c ，那么需要对 a、b、c 中的每一个的 64K 行（在一个块中）的压缩信息进行解压缩以进一步过滤和传送该行。 So, it behooves you to list only the desired columns, not blindly say SELECT * .因此，您应该只列出所需的列，而不是盲目地说SELECT * 。

Since every column is separately indexed all the time, there is no need to say INDEX(a) .由于每列始终单独索引，因此无需说INDEX(a) 。 (I don't know if INDEX(a,b) can even be specified for a columnar DB.) （我不知道是否可以为柱状数据库指定INDEX(a,b) 。）

Caveat: I am describing Infinidb, which is available with MariaDB.警告：我描述的是 Infinidb，它在 MariaDB 中可用。 I don't know about any other columnar engines.我不知道任何其他柱状引擎。

Columnar Indexes (also known as "vertical data storage") stores data in a hash and compressed mode.列索引（也称为“垂直数据存储”）以散列和压缩模式存储数据。 All columns invoked in the index key are indexed separately.索引键中调用的所有列都单独编入索引。 Hashing decrease the volume of data stored.散列减少了存储的数据量。 The compressing method use only one value for repetitive occurrences (dictionnary, eventually partial).压缩方法仅使用一个值用于重复出现（字典，最终部分）。

This technic have two major difficulties :这项技术有两个主要困难：

First you can have collision, because a hash result can be the same for two distinct values.首先，您可能会发生冲突，因为对于两个不同的值，哈希结果可能相同。 So the index must manage collisions.所以索引必须管理冲突。
Second, the hash and compress algorithms used is a very heavy consumer of resources like CPU.其次，使用的散列和压缩算法是 CPU 等资源的大量消耗者。

Those type of indexes are stored as vectors.这些类型的索引存储为向量。

Ordinary, those type of indexes are used only for read only tables, especially for the business intelligence (OLAP databases).通常，这些类型的索引仅用于只读表，尤其是商业智能（OLAP 数据库）。

A columnar index can be used in a "seekable" way only for an equality predicate (COLUMN_A = OneValue).列式索引只能以“可查找”的方式用于相等谓词 (COLUMN_A = OneValue)。 But it is also adequate for GROUPING or DISTINCT operations.但它也适用于 GROUPING 或 DISTINCT 操作。 Columnar index does not support range seek, including the LIKE 'foo%'.列式索引不支持范围查找，包括 LIKE 'foo%'。

Some database vendors have get around the huge resources needed while inserting or updating by adding some intermediate algorithms that decrease the CPU.一些数据库供应商通过添加一些降低 CPU 的中间算法来解决插入或更新时所需的大量资源。 This is the case for Microsoft SQL Server that use a delta store for newly modified rows.对于对新修改的行使用增量存储的 Microsoft SQL Server，就是这种情况。 With this technic, the table can be used in a relational way like any classical OLTP dataabase.通过这种技术，可以像任何经典的 OLTP 数据库一样以关系方式使用该表。

For instance, Microsoft SQL Server introduced first the columnstore index in 2012 version, but this made the table read only.例如，Microsoft SQL Server 在 2012 版本中首先引入了列存储索引，但这使表成为只读的。 In 2014 the clustered columnstore index (all the columns of the table was indexed) was released and the table was writetable. 2014年发布了聚集列存储索引（表的所有列都被索引），表是可写的。 And finally in the 2016 version, the columnstore index clustered ornot, no longer demands any part of the table to be read only.最后在 2016 版本中，列存储索引是否聚集，不再要求表的任何部分都是只读的。 This was made possible because a particular search algorithm, named "Batch Mode" was developed by Microsoft Research, and does not works by reading the data row by row...这是因为一种名为“批处理模式”的特定搜索算法是由 Microsoft Research 开发的，并且不能通过逐行读取数据来工作...

To read :阅读：

Enhancements to SQL Server Column Stores SQL Server 列存储的增强功能

Columnstore and B+ tree –Are Hybrid Physical Designs Important? 列存储和 B+ 树——混合物理设计重要吗？