简体   繁体   English

HBase大容量负载使用

[英]HBase bulk load usage

I am trying to import some HDFS data to an already existing HBase table. 我正在尝试将一些HDFS数据导入到已经存在的HBase表中。 The table I have was created with 2 column families, and with all the default settings that HBase comes with when creating a new table. 我创建的表具有2个列族,并且具有创建新表时HBase附带的所有默认设置。 The table is already filled up with a large volume of data, and it has 98 online regions. 该表已经充满了大量数据,并且具有98个在线区域。 The type of row keys it has, are under the form of(simplified version) : 2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH. 它具有的行键的类型为(简化版本):2-CHARS_ID + 6-DIGIT-NUMBER + 3 X 32-CHAR-MD5-HASH。

Example of key: IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef. 密钥示例:IP281113ec46d86301568200d510f47095d6c99db18630b0a23ea873988b0fb12597e05cc6b30c479dfb9e9d627ccfc4c5dd5fef。

The data I want to import is on HDFS, and I am using a Map-Reduce process to read it. 我要导入的数据在HDFS上,并且我正在使用Map-Reduce进程读取它。 I emit Put objects from my mapper, which correspond to each line read from the HDFS files. 我从我的映射器发出Put对象,该对象对应于从HDFS文件读取的每一行。 The existing data has keys which will all start with "XX181113". 现有数据具有全部以“ XX181113”开头的键。 The job is configured with : 作业配置有:

HFileOutputFormat.configureIncrementalLoad(job, hTable)

Once I start the process, I see it configured with 98 reducers (equal to the online regions the table has), but the issue is that 4 reducers got 100% of the data split among them, while the rest did nothing. 开始该过程后,我看到它配置了98个reducer(等于表中的在线区域),但是问题是4个reducer会在其中拆分100%的数据,而其余的则什么都不做。 As a result, I see only 4 folder outputs, which have a very large size. 结果,我只看到4个文件夹输出,它们的大小非常大。 Are these files corresponding to 4 new regions which I can then import to the table? 这些文件对应于我可以导入到表中的4个新regions吗? And if so, why only 4, while 98 reducers get created? 如果是这样,为什么只创建4个,同时创建98个减速器? Reading HBase docs 阅读HBase文档

In order to function efficiently, HFileOutputFormat must be configured such that each output HFile fits within a single region. In order to do this, jobs whose output will be bulk loaded into HBase use Hadoop's TotalOrderPartitioner class to partition the map output into disjoint ranges of the key space, corresponding to the key ranges of the regions in the table.

confused me even more as to why I get this behaviour. 关于我为什么得到这种行为,我更加困惑。

Thanks! 谢谢!

The number of maps you'd get doesn't depend on the number of regions you have in the table but rather how the data is split into regions (each region contains a range of keys). 您将获得的地图数量并不取决于表中的区域数量,而是取决于如何将数据拆分为区域(每个区域都包含一个键范围)。 since you mention that all your new data start with the same prefix it is likely it only fit into a few regions. 由于您提到所有新数据都以相同的前缀开头,因此很可能只适用于少数区域。 You can pre split your table so that the new data would be divided between more regions 您可以预先拆分表格,以便将新数据划分到更多区域中

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM