[英]is a column family placed one next to the other on disk in HBase? another words, is HBase Column-oriented?
I'm trying to understand if HBase is a Column-oriented DB.我想了解 HBase 是否是面向列的数据库。 I understand the structure of one row of HBase - it is divided into column families(which are static and don't change) and each column family can have dynamic number of columns:
我了解一行 HBase 的结构 - 它分为列族(static 并且不会改变)并且每个列族可以具有动态列数:
row: row-key1, familyA:a1 familyA:a2... familyB:b1,familyB:b2,familyB:b3
Now it id stated that a column family is stored together on disk.现在它指出列族一起存储在磁盘上。 so familyA:a1 familyA:a2 columns of row:row-key1 will be stored together on disk.
所以 row:row-key1 的 familyA:a1 familyA:a2 列将一起存储在磁盘上。
But what about familyA:a1 familyA:a2 values in two different rows ?但是两个不同行中的 familyA:a1 familyA:a2 值呢? are they also store one after the other?
他们也一个接一个地存储吗? which would me that HBase is Column-oriented .
我认为 HBase 是面向列的。
Everywhere I look I see that HBase is Wide-Column store , is it the same as Column-oriented?我到处都看到 HBase 是Wide-Column store ,它和 Column-oriented 一样吗?
Before answering the question, I want to point out one thing about the HBase use case that'll make it easier to understand the HFile layout.在回答问题之前,我想指出有关 HBase 用例的一件事,这将使理解 HFile 布局变得更容易。 HBase (from read workload perspective) is optimized for random key value lookups in really long and wide tables (trillions of rows and millions of columns).
HBase(从读取工作负载的角度来看)针对非常长和宽的表(数万亿行和数百万列)中的随机键值查找进行了优化。 It works reasonably well for rowkey prefix based scans too, but it's not built for large single column scans.
它也适用于基于行键前缀的扫描,但它不是为大型单列扫描构建的。
That said, HBase is not a truly columnar database, especially when seen as a wide column store too.也就是说,HBase 并不是真正的列式数据库,尤其是当它也被视为宽列存储时。 HBase stores all columns for the same row key and the same column family together.
HBase 将相同行键和相同列族的所有列存储在一起。 However, different column families are stored in different files which gives the columnar nature to HBase in the sense that you can control configs for each column family independently and you can scan a single column family without worrying about read costs introduced due to columns in other families.
然而,不同的列族存储在不同的文件中,这赋予了 HBase 的列性质,因为您可以独立控制每个列族的配置,并且您可以扫描单个列族而不用担心由于其他列族中的列而引入的读取成本. This is how a single HFile looks like (notice that a column is called a qualifier in HBase. Also Type can be a Put or Delete ):
这就是单个 HFile 的样子(请注意,列在 HBase 中称为限定符。类型也可以是Put或Delete ):
RowKey1:Family1:Qualifier1:Timestamp1:Type:Value
RowKey1:Family1:Qualifier1:Timestamp2:Type:Value
RowKey1:Family1:Qualifier2:Timestamp0:Type:Value
RowKey1:Family1:Qualifier3:Timestamp2:Type:Value
RowKey2:Family1:Qualifier1:Timestamp0:Type:Value
RowKey2:Family1:Qualifier2:Timestamp2:Type:Value
Notice that Qualifier1 is not adjacent for RowKey1 and RowKey2 .请注意, Qualifier1与RowKey1和RowKey2不相邻。 Instead, all columns for the same row ie RowKey1 key are adjacent.
相反,同一行的所有列(即RowKey1键)都是相邻的。
If you stored every column in its own column family, HBase would become a truly columnar store, but then it would not be able to provide support for millions of columns due to single-row across-columns ACID semnatics that it offers due to its locking strategies to implement that.如果您将每一列存储在其自己的列族中,HBase 将成为一个真正的列式存储,但由于其锁定提供的单行跨列 ACID 语义,它将无法为数百万列提供支持战略来实现这一点。
Edit编辑
Given the above structure of the HFile, the HFile data is actually stored in sorted format based on the following key (Note that one file can have one family only, so, storing the family name in the data itself is somewhat redundant, but there are other uses for that outside the scope of this question):鉴于HFile的上述结构,HFile数据实际上是根据以下键按排序格式存储的(请注意,一个文件只能有一个家族,因此,将家族名称存储在数据本身中有些多余,但是有此问题的 scope 之外的其他用途):
RowKey:Family:Qualifier:Timestamp:Type
This sorting order, combined with block level indexes and bloom filters on HFiles makes HBase blazing fast in locating any random RowKey , or a RowKey, Family:Qualifier tuple, or a RowKey, Family:Qualifier, Timestamp tuple.这种排序顺序与 HFiles 上的块级索引和布隆过滤器相结合,使 HBase 在定位任何随机RowKey或RowKey、Family:Qualifier元组或RowKey、Family:Qualifier、Timestamp元组时速度非常快。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.