简体繁体 English

Hbase排序效率

[英]Hbase Sorting efficiency

原文 2014-07-25 03:48:17 4 1 hadoop/ hbase

In my Hbasetable I have Employee name "Simon" at row-100 and at row-4000 I have another employee with same name "Simon". 在我的Hbasetable中，在第100行的员工名为“ Simon”，在第4000行的员工名为“ Simon”。 Now I want to get all employees with name "Simon" from my Employee table.The row-key is the SSN of each employee. 现在，我想从Employee表中获取所有名称为“ Simon”的员工。行键是每个员工的SSN。

My question is,if i fire a query to get all employees with the name "Simon" .How is the search efficient in Hbase. 我的问题是，如果我触发查询以获取所有名称为“ Simon”的员工。Hbase的搜索效率如何？ Because the first name "simon" is in row 100 and second "simon" name is in 4000.To get employess with name "simon" hbase has to traverse all the table to find out this name .How will be the search efficient as we are doing a full table scan in this scenario? 因为第一个名称“ simon”在行100中，第二个“ simon”名称在4000行中。要获得名称为“ simon”的员工，hbase必须遍历所有表以找出该名称。搜索效率如何，因为我们在这种情况下正在做全表扫描？

1 个解决方案

If you have to do a full table scan - which you do - that's not going to be a great solution. 如果您必须进行全表扫描（您需要这样做），那将不是一个很好的解决方案。 In fact, if you have a very large number of rows, it's going to be a terrible solution. 实际上，如果您有很多行，这将是一个糟糕的解决方案。

What most relational database management systems (or "SQL databases") do to solve this problem is create indexes . 大多数关系数据库管理系统（或“ SQL数据库”）为解决此问题所做的就是创建索引。 Since you're using a "NoSQL database," it won't create indexes for you automatically. 由于您使用的是“ NoSQL数据库”，因此它不会自动为您创建索引。

Let's look at how to create indexes manually so particular types of queries are accommodated efficiently. 让我们看一下如何手动创建索引，以便有效地容纳特定类型的查询。

Suppose you have a collection of entities S where each entity E in S has a unique key K(E) and an attribute value V(E) . 假设你有实体的集合S ，每个实体E在S都有一个唯一的密钥K(E)和属性值V(E) Further suppose your entities are in an HBase table, one per row, with K(E) as the row key for each entity E . 进一步假设您的实体位于HBase表中，每行一个，其中K(E)作为每个实体E的行键。

An index of S with respect to V is another table that typically comes in one of three forms. S相对于V的索引是另一张表，通常以以下三种形式之一出现。

Index Form 1 索引表1

Suppose that V(E) is also unique for each entity E . 假设V(E)对于每个实体E也是唯一的。 Then the index of S with respect to V is a table with one entity per row, where the table has row key V(E) and a column containing K(E) . 那么S相对于V的索引是一个表，每行一个实体，其中该表具有行键V(E)和一个包含K(E)的列。

To look up an entity E by V(E) , simply go to that row to look up K(E) . 要通过V(E)查找实体E ，只需转到该行以查找K(E) 。

If your attribute values V(E) are unique, use this approach. 如果您的属性值V(E)是唯一的，请使用此方法。

Think a table of Employee entities, where each employee has a unique EmployeeID within the company, K(E) . 考虑一下Employee实体表，其中每个员工在公司内都有一个唯一的EmployeeID K(E) 。 The main Employee table could use the unique EmployeeID as the row key, and the Employee_SSN_Index could use the employee SSN number V(E) (which is also unique). 主Employee表可以使用唯一的EmployeeID作为行键，而Employee_SSN_Index可以使用员工的SSN号V(E) （也是唯一的）。 This provide a fast lookup of employees by their SSN numbers. 这可以通过员工的SSN号快速查找员工。

Index Form 2 索引表2

To look up all the entities E with V(E) , simply do a prefix scan of the rows that start with the V(E) . 要查找所有实体E与V(E)只要做到这一点与开始行的前缀扫描V(E)

There is a variant for the case when the length of V(E) is not fixed with and it may be impossible to distinguish the point at which V(E) ends and K(E) begins. V(E)的长度不固定的情况有一个变体，可能无法区分V(E)结束和K(E)开始的点。 A separator may be placed between V(E) and K(E) in the row key. 可以在行键中的V(E)和K(E)之间放置一个分隔符。 For example V(E) ++ "|" ++ K(E) 例如V(E) ++ "|" ++ K(E) V(E) ++ "|" ++ K(E) . V(E) ++ "|" ++ K(E) 。 In this case, the prefix to scan is V(E) ++ "|" 在这种情况下，要扫描的前缀为V(E) ++ "|" . 。

A Employee_Department_Index table could use the DepartmentID an employee works in as the attribute value V(E) . Employee_Department_Index表可以使用员工工作的DepartmentID作为属性值V(E) 。

Index Form 3 索引表3

Suppose that V(E) is potentially not unique for each entity E ; 假设V(E)对于每个实体E可能不是唯一的； that is, there may be duplicates. 也就是说，可能有重复项。 Then the index of S with respect to V is a table with a set of entities per row, where the table has a row key of V(E) and a column family F with qualifier K(E) . 然后，相对于V的S索引是一个表，该表每行具有一组实体，其中该表具有V(E)的行键和带有限定符K(E)的列族F That is, the entities are grouped by the attribute value into rows. 也就是说，实体通过属性值分组为行。

To look up all the entities E with V(E) , grab the row V(E) requesting all columns in the column family F . 要使用V(E)查找所有实体E ，请抓取V(E)行，以请求列族F中的所有列。

This approach should really be kept to the case where the number of entities in each row of the index is small. 实际上，在索引的每一行中的实体数量很少的情况下，应保持这种方法。