简体   繁体   中英

How to get the values from HBase table?

I have a table in Hbase with one column family called a and around 30 columns in it. Below is a sample that shows cell values of two row keys-

ROW                                  COLUMN+CELL
 00:001000574                        column=a:aasbig, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasdel, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasdhq, timestamp=1486493154559, value=false
 00:001000574                        column=a:aasfsc, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasgbm, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasgbr, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasmcu, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasser, timestamp=1486493154559, value=true
 00:001000574                        column=a:aastlp, timestamp=1486493154559, value=true
 00:001000574                        column=a:aasvia, timestamp=1486493154559, value=true
 00:001000707                        column=a:aasbig, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasdel, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasdhq, timestamp=1486493154559, value=true
 00:001000707                        column=a:aasfsc, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasgbm, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasgbr, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasmcu, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasser, timestamp=1486493154559, value=false
 00:001000707                        column=a:aastlp, timestamp=1486493154559, value=false
 00:001000707                        column=a:aasvia, timestamp=1486493154559, value=false

Each column has a value with either true or false . These values are subjected to change and week later the values may be different. I would like to capture the old and new values. The result should be stored in a CSV file.

My requirement is, when I run the code for the first time I should see the OLDVALUE as NULL and all the values from the HBase table should be a part of NEWVALUE.

Below is the output I want to see in a CSV file when run for the first time.

NUM,PRODUCT,OLDVALUE,NEWVALUE
001000574,aasbig,NULL,true
001000574,aasdel,NULL,true
001000574,aasdhq,NULL,false
001000574,aasfsc,NULL,true
001000574,aasgbm,NULL,true
001000574,aasgbr,NULL,true
001000574,aasmcu,NULL,true
001000574,aasser,NULL,true
001000574,aastlp,NULL,true
001000574,aasvia,NULL,true
001000707,aasbig,NULL,false
001000707,aasdel,NULL,false
001000707,aasdhq,NULL,true
001000707,aasfsc,NULL,false
001000707,aasgbm,NULL,false
001000707,aasgbr,NULL,false
001000707,aasmcu,NULL,false
001000707,aasser,NULL,false
001000707,aastlp,NULL,false
001000707,aasvia,NULL,false

From Second time on wards when I run the code all the values in NEWVALUES from the previous run should now be under OLDVALUES and the NEWVALUES should get the current values from the HBase table. Like the below sample output

NUM,PRODUCT,OLDVALUE,NEWVALUE
001000574,aasbig,true,true
001000574,aasdel,true,true
001000574,aasdhq,false,false
001000574,aasfsc,true,true
001000574,aasgbm,true,false
001000574,aasgbr,true,true
001000574,aasmcu,true,false
001000574,aasser,true,false
001000574,aastlp,true,true
001000574,aasvia,true,true
001000707,aasbig,false,true
001000707,aasdel,false,true
001000707,aasdhq,true,true
001000707,aasfsc,false,false
001000707,aasgbm,false,false
001000707,aasgbr,false,false
001000707,aasmcu,false,true
001000707,aasser,false,true
001000707,aastlp,false,false
001000707,aasvia,false,true

What I tried: I created a Hive-on-Hbase table and while querying the table I was only able to get the NUM and the value . I was unable to get the HBase column name. Also I had challenges in getting the Old and New value unless I implement some join operations.

Can we write Pig script to achieve this?

Any help is much appreciated.

This can be easily done programmatically by following steps:

  1. For first time Scan the table and get its output. There are other optimized variants for this too. Scan scan = new Scan(); ResultScanner scanner = table.getScanner(scan); for (Result result = scanner.next(); result != null; result = scanner.next()){
    //Create required format }
  2. For next time Scan the table using Range filter providing startTime part time in the API. It will fetch all the records updated after a certain time period. Result will contain versions of the updated record. You can use last and recent version details and generate the output. Scan scan = new Scan(); scan.setTimeRange(startTime, endTime); ResultScanner scanner = table.getScanner(scan); for (Result result = scanner.next(); result != null; result = scanner.next()){ NavigableMap<byte[],NavigableMap<byte[],NavigableMap<Long,byte[]>>> allVersions=result.getMap(); //allVersions map will give all versions of the record. //Create required format } Scan scan = new Scan(); scan.setTimeRange(startTime, endTime); ResultScanner scanner = table.getScanner(scan); for (Result result = scanner.next(); result != null; result = scanner.next()){ NavigableMap<byte[],NavigableMap<byte[],NavigableMap<Long,byte[]>>> allVersions=result.getMap(); //allVersions map will give all versions of the record. //Create required format } I am not sure about Hive or Pig process for this. Hope this helps!

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM