简体   繁体   中英

Inserting filename as rowkey using HBase MapReduce

using Java API, I'm trying to Put() to HBase 1.1.x the content of some files. To do so, I have created WholeFileInput class (ref : Using WholeFileInputFormat with Hadoop MapReduce still results in Mapper processing 1 line at a time ) to make MapReduce read the entire file instead of one line. But unfortunately, I cannot figure out how to form my rowkey from the given filename.

Example:

Input:

file-123.txt

file-524.txt

file-9577.txt

...

file-"anotherNumber".txt

Result on my HBase table:

Row-----------------Value

123-----------------"content of 1st file"

524-----------------"content of 2nd file"

...etc

If anyone has already faced this situation to help me with it

Thanks in advance.

Your

rowkey

can be like this

rowkey  = prefix + (filenamepart or full file name) + Murmurhash(fileContent)

where your prefix can be between what ever presplits you have done with your table creation time.

For ex :

create 'tableName', {NAME => 'colFam', VERSIONS => 2, COMPRESSION => 'SNAPPY'}, 
    {SPLITS => ['0','1','2','3','4','5','6','7']}

prefix can be any random id generated between range of pre-splits.

This kind of row key will avoid hot-spotting also if data increases. & Data will be spread across region server.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM