简体   繁体   English

什么是宽列存储?

[英]What exactly is a wide column store?

Googling for a definition either returns results for a column oriented DB or gives very vague definitions.谷歌搜索定义要么返回面向列的数据库的结果,要么给出非常模糊的定义。

My understanding is that wide column stores consist of column families which consist of rows and columns.我的理解是,宽列存储由由行和列组成的列族组成。 Each row within said family is stored together on disk.所述系列中的每一行都一起存储在磁盘上。 This sounds like how row oriented databases store their data.这听起来像是面向行的数据库存储数据的方式。 Which brings me to my first question:这让我想到了我的第一个问题:

How are wide column stores different from a regular relational DB table?宽列存储与常规关系数据库表有何不同? This is the way I see it:这是我的看法:

* column family        -> table
* column family column -> table column
* column family row    -> table row

This image from Database Internals simply looks like two regular tables:这张来自Database Internals的图片看起来就像两个常规表:

两个列族、内容和锚点

The guess I have as to what is different comes from the fact that "multi-dimensional map" is mentioned along side wide column stores.我对不同之处的猜测来自这样一个事实,即“多维地图”在侧面宽列存储中被提及。 So here is my second question:所以这是我的第二个问题:

Are wide column stores sorted from left to right?宽列存储是否从左到右排序? Meaning, in the above example, are the rows sorted first by Row Key , then by Timestamp , and finally by Qualifier ?意思是,在上面的例子中,行是Row Key排序,然后是Timestamp ,最后是Qualifier吗?

Let's start with the definition of a wide column database.让我们从宽列数据库的定义开始。

Its architecture uses (a) persistent, sparse matrix, multi-dimensional mapping (row-value, column-value, and timestamp) in a tabular format meant for massive scalability (over and above the petabyte scale).它的架构使用 (a) 持久、稀疏矩阵、多维映射(行值、列值和时间戳)以表格格式表示,以实现大规模可扩展性(超过 PB 级)。

A relational database is designed to maintain the relationship between the entity and the columns that describe the entity.关系数据库旨在维护实体与描述实体的列之间的关系。 A good example is a Customer table.一个很好的例子是客户表。 The columns hold values describing the Customer's name, address, and contact information.这些列包含描述客户姓名、地址和联系信息的值。 All of this information is the same for each and every customer.对于每个客户,所有这些信息都是相同的。

A wide column database is one type of NoSQL database.宽列数据库是 NoSQL 数据库的一种。

Maybe this is a better image of four wide column databases.也许这是四个宽列数据库的更好图像。

宽列数据库

My understanding is that the first image at the top, the Column model, is what we called an entity/attribute/value table.我的理解是顶部的第一张图片,model 列,就是我们所说的实体/属性/值表。 It's an attribute/value table within a particular entity (column).它是特定实体(列)中的属性/值表。

For Customer information, the first wide-area database example might look like this.对于客户信息,第一个广域数据库示例可能如下所示。

Customer ID    Attribute    Value
-----------    ---------    ---------------
     100001    name         John Smith
     100001    address 1    10 Victory Lane
     100001    address 3    Pittsburgh, PA  15120

Yes, we could have modeled this for a relational database.是的,我们可以为关系数据库建模。 The power of the attribute/value table comes with the more unusual attributes.属性/值表的强大之处在于更不寻常的属性。

Customer ID    Attribute    Value
-----------    ---------    ---------------
     100001    fav color    blue
     100001    fav shirt    golf shirt

Any attribute that a marketer can dream up can be captured and stored in an attribute/value table.营销人员可以想象的任何属性都可以被捕获并存储在属性/值表中。 Different customers can have different attributes.不同的客户可以有不同的属性。

The Super Column model keeps the same information in a different format.超级列 model 以不同的格式保存相同的信息。

Customer ID: 100001
Attribute    Value
---------    --------------
fav color    blue
fav shirt    golf shirt

You can have as many Super Column models as you have entities.您可以拥有与实体一样多的超级柱模型。 They can be in separate NoSQL tables or put together as a Super Column family.它们可以位于单独的 NoSQL 表中,也可以放在一起作为超级列系列。

The Column Family and Super Column family simply gives a row id to the first two models in the picture for quicker retrieval of information. Column Family 和 Super Column family 只是为图片中的前两个模型提供了一个 row id,以便更快地检索信息。

Most (if not all) Wide-column stores are indeed row-oriented stores in that every parts of a record are stored together.大多数(如果不是全部)宽列存储确实是面向行的存储,因为记录的每个部分都存储在一起。 You can see that as a 2-dimensional key-value store.您可以将其视为二维键值存储。 The first part of the key is used to distribute the data across servers, the second part of the key lets you quickly find the data on the target server.密钥的第一部分用于跨服务器分发数据,密钥的第二部分可以让您快速找到目标服务器上的数据。

Wide-column stores will have different features and behaviors.宽列商店将具有不同的特征和行为。 However, Apache Cassandra, for example, allows you to define how the data will be sorted.但是,例如,Apache Cassandra 允许您定义数据的排序方式。 Take this table for example:以这张表为例:

| id | country | timestamp  | message |
|----+---------+------------+---------|
| 1  | US      | 2020-10-01 | "a..."  |
| 1  | JP      | 2020-11-01 | "b..."  |
| 1  | US      | 2020-09-01 | "c..."  |
| 2  | CA      | 2020-10-01 | "d..."  |
| 2  | CA      | 2019-10-01 | "e..."  |
| 2  | CA      | 2020-11-01 | "f..."  |
| 3  | GB      | 2020-09-01 | "g..."  |
| 3  | GB      | 2020-09-02 | "h..."  |
|----+---------+------------+---------|

If your partitioning key is (id) and your clustering key is (country, timestamp) , the data will be stored like this:如果您的分区键是(id)并且您的集群键是(country, timestamp) ,则数据将像这样存储:

[Key 1]
1:JP,2020-11-01,"b..." | 1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..."
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."

Or in table form:或以表格形式:

| id | country | timestamp  | message |
|----+---------+------------+---------|
| 1  | JP      | 2020-11-01 | "b..."  |
| 1  | US      | 2020-09-01 | "c..."  |
| 1  | US      | 2020-10-01 | "a..."  |
| 2  | CA      | 2019-10-01 | "e..."  |
| 2  | CA      | 2020-10-01 | "d..."  |
| 2  | CA      | 2020-11-01 | "f..."  |
| 3  | GB      | 2020-09-01 | "g..."  |
| 3  | GB      | 2020-09-02 | "h..."  |
|----+---------+------------+---------|

If you change the primary key (composite of partitioning and clustering key) to (id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id is the partitioning key, timestamp is the clustering key in descending order), the result would be:如果将主键(分区键和集群键的组合)更改为(id, timestamp) WITH CLUSTERING ORDER BY (timestamp DESC) (id 是分区键,timestamp 是降序的集群键),结果将是:

[Key 1]
1:US,2020-09-01,"c..." | 1:US,2020-10-01,"a..." | 1:JP,2020-11-01,"b..." 
[Key2]
2:CA,2019-10-01,"e..." | 2:CA,2020-10-01,"d..." | 2:CA,2020-11-01,"f..."
[Key3]
3:GB,2020-09-01,"g..." | 3:GB,2020-09-02,"h..."

Or in table form:或以表格形式:

| id | country | timestamp  | message |
|----+---------+------------+---------|
| 1  | US      | 2020-09-01 | "c..."  |
| 1  | US      | 2020-10-01 | "a..."  |
| 1  | JP      | 2020-11-01 | "b..."  |
| 2  | CA      | 2019-10-01 | "e..."  |
| 2  | CA      | 2020-10-01 | "d..."  |
| 2  | CA      | 2020-11-01 | "f..."  |
| 3  | GB      | 2020-09-01 | "g..."  |
| 3  | GB      | 2020-09-02 | "h..."  |
|----+---------+------------+---------|

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM