I am trying to import a few large .csv files into HBase (>1TB in sum). The data looks like a dump from a relational DB, but does not have a UID. Also I do not want to import all columns. I decided I need to run a custom MapReduce job first to get them into the required format (select columns + generate UID) so that I can import them using the standard hbase importtsv bulk import.
My question: Can I just create my own composite row key, say storeID:year:UID using MapReduce and then feed it to the tsv import? So say, my data looks like this:
row_key | price | quantity | item_id
A:2012:1| 0.99 | 1 | 001
A:2012:2| 0.99 | 2 | 012
B:2013:1| 0.99 | 1 | 004
From what I understand, HBase stores everything as bytes, except for timestamps. Is it going to understand this is a composite key?!
Any hints are appreciated!
I asked the same question over at Cloudera, and the answer can be found here.
Basically, the answer is yes, and no separator characters are needed. I used a MapReduce job to transform the data to the following format:
A2012:1,0.99,1,001 A2012:2,0.99,2,012
Using importtsv and completebulkload, the data was then correctly loaded into the correct HBase regions. I pre-split the table using the storeID (A,B,C,...).
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.