简体   繁体   English

在 HBase 的磁盘上,列族是一个挨着一个放置的吗? 换句话说,HBase是面向列的吗?

[英]is a column family placed one next to the other on disk in HBase? another words, is HBase Column-oriented?

I'm trying to understand if HBase is a Column-oriented DB.我想了解 HBase 是否是面向列的数据库。 I understand the structure of one row of HBase - it is divided into column families(which are static and don't change) and each column family can have dynamic number of columns:我了解一行 HBase 的结构 - 它分为列族(static 并且不会改变)并且每个列族可以具有动态列数:

row: row-key1, familyA:a1 familyA:a2... familyB:b1,familyB:b2,familyB:b3

Now it id stated that a column family is stored together on disk.现在它指出列族一起存储在磁盘上。 so familyA:a1 familyA:a2 columns of row:row-key1 will be stored together on disk.所以 row:row-key1 的 familyA:a1 familyA:a2 列将一起存储在磁盘上。

But what about familyA:a1 familyA:a2 values in two different rows ?但是两个不同行中的 familyA:a1 familyA:a2 值呢? are they also store one after the other?他们也一个接一个地存储吗? which would me that HBase is Column-oriented .我认为 HBase 是面向列的

Everywhere I look I see that HBase is Wide-Column store , is it the same as Column-oriented?我到处都看到 HBase 是Wide-Column store ,它和 Column-oriented 一样吗?

Before answering the question, I want to point out one thing about the HBase use case that'll make it easier to understand the HFile layout.在回答问题之前,我想指出有关 HBase 用例的一件事,这将使理解 HFile 布局变得更容易。 HBase (from read workload perspective) is optimized for random key value lookups in really long and wide tables (trillions of rows and millions of columns). HBase(从读取工作负载的角度来看)针对非常长和宽的表(数万亿行和数百万列)中的随机键值查找进行了优化。 It works reasonably well for rowkey prefix based scans too, but it's not built for large single column scans.它也适用于基于行键前缀的扫描,但它不是为大型单列扫描构建的。

That said, HBase is not a truly columnar database, especially when seen as a wide column store too.也就是说,HBase 并不是真正的列式数据库,尤其是当它也被视为宽列存储时。 HBase stores all columns for the same row key and the same column family together. HBase 将相同行键和相同列族的所有列存储在一起。 However, different column families are stored in different files which gives the columnar nature to HBase in the sense that you can control configs for each column family independently and you can scan a single column family without worrying about read costs introduced due to columns in other families.然而,不同的列族存储在不同的文件中,这赋予了 HBase 的列性质,因为您可以独立控制每个列族的配置,并且您可以扫描单个列族而不用担心由于其他列族中的列而引入的读取成本. This is how a single HFile looks like (notice that a column is called a qualifier in HBase. Also Type can be a Put or Delete ):这就是单个 HFile 的样子(请注意,列在 HBase 中称为限定符类型也可以是PutDelete ):

RowKey1:Family1:Qualifier1:Timestamp1:Type:Value
RowKey1:Family1:Qualifier1:Timestamp2:Type:Value
RowKey1:Family1:Qualifier2:Timestamp0:Type:Value
RowKey1:Family1:Qualifier3:Timestamp2:Type:Value
RowKey2:Family1:Qualifier1:Timestamp0:Type:Value
RowKey2:Family1:Qualifier2:Timestamp2:Type:Value

Notice that Qualifier1 is not adjacent for RowKey1 and RowKey2 .请注意, Qualifier1RowKey1RowKey2不相邻。 Instead, all columns for the same row ie RowKey1 key are adjacent.相反,同一行的所有列(即RowKey1键)都是相邻的。

If you stored every column in its own column family, HBase would become a truly columnar store, but then it would not be able to provide support for millions of columns due to single-row across-columns ACID semnatics that it offers due to its locking strategies to implement that.如果您将每一列存储在其自己的列族中,HBase 将成为一个真正的列式存储,但由于其锁定提供的单行跨列 ACID 语义,它将无法为数百万列提供支持战略来实现这一点。

Edit编辑

Given the above structure of the HFile, the HFile data is actually stored in sorted format based on the following key (Note that one file can have one family only, so, storing the family name in the data itself is somewhat redundant, but there are other uses for that outside the scope of this question):鉴于HFile的上述结构,HFile数据实际上是根据以下键按排序格式存储的(请注意,一个文件只能有一个家族,因此,将家族名称存储在数据本身中有些多余,但是有此问题的 scope 之外的其他用途):

RowKey:Family:Qualifier:Timestamp:Type

This sorting order, combined with block level indexes and bloom filters on HFiles makes HBase blazing fast in locating any random RowKey , or a RowKey, Family:Qualifier tuple, or a RowKey, Family:Qualifier, Timestamp tuple.这种排序顺序与 HFiles 上的块级索引和布隆过滤器相结合,使 HBase 在定位任何随机RowKeyRowKey、Family:Qualifier元组或RowKey、Family:Qualifier、Timestamp元组时速度非常快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 HBase数据导出到S3 - HBase data export to S3 SQL - 根据其他列的值重命名一列 - SQL - Rename one column based on values from other columns 根据 Kusto 中一列的最大值获取其他列 - Get Other columns based on max of one column in Kusto 我需要使用一列的行到 select 另一列 - I need to use the rows of one column to select another column 如何添加一个列,该列根据另一列的条件对其他列的值进行字符串加法 - How to add a column which does a string addition of values from other column based on condition from another column 从另一个表中选择一个表的列名 - selecting column names of one table from another table 如何将一个表中的列与 BigQuery 中另一个表中的数组进行比较? - How to compare column in one table with array from another table in BigQuery? 为什么说 HBase 行按字典顺序存储? - Why HBase rows are said to be stored as lexicographically sorted? 如何将一列的数据加载到同一个表的另一列中 - How can I load data of one column into another column in the same table 如何从一列中找到前 3 个返回值中的每一个,从另一列中查找前 3 个返回值? - How to find for each of the top 3 returned value from one column, the 3 top values from another column?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM